[M3devel] This disgusting TEXT business

Mon Dec 22 01:07:39 CET 2008

On 23/12/2008, at 12:54 AM, hendrik at topoi.pooq.com wrote:

> On Mon, Dec 22, 2008 at 06:07:21AM +0900, Darko wrote:
>> The way to handle this would be to require an iterate function on  
>> each
>> text representation which successively returned the next character in
>> a string as a UNICHAR with defined characteristics. That would allow
>> for comparison and transfer of strings.
>>
>> The way I see it, TEXT should be revealed to be a subtype of the
>> "TextObj" object that has methods that basically mirror the Text
>> interface with some additions. The existing text representation code
>> is adapted to this object (fairly simple) and new representations
>> subclass that object. A new function is added to text interface that
>> allows the user to specify the default representation used when
>> creating new strings. This would be the typecode of the object to
>> allocate, required to be a subtype of TextObj of course. Some TextObj
>> methods would be required to be implemented by the representation but
>> most could be handled by the ancestor. Many of the methods could be
>> implemented by the encoding for more efficient handling, for instance
>> when the text being compared or concatenated is the same  
>> representation.
>>
>> UNICHAR = <enumeration containing all Unicode characters>;
>>
>> TextObj = OBJECT METHODS
>>
>>  length(t: T): CARDINAL; (* required to be implemented by the
>> encoding *)
>
> How is T related to TextObj?  Does this return the number of  
> characters
> of the number of bytes occupied by the stinrg?

Sorry, T = TEXT. It returns the logical number of characters according  
to some definition, which I imagine would be Unicode NFC (Normalised  
Form, Canonical Composition).

>
>
>>  empty(t: T): BOOLEAN; (* required to be implemented by the encoding
>> *)
>>  hasWideChars(t: T): BOOLEAN; (* required - meaning containing
>> anything other than CHAR values *)
>
> Does this require a scan of the string to determine whether any wide
> characters are actually present, or does it just indicate whether the
> data representation used in this string is capable of handling wide
> characters?

No to both. It doesn't require a scan, you can cache this flag. It's  
so you can possibly be more efficient when you know that you only have  
to deal with CHAR values.

>
>
>>  next(VAR index: INTEGER; VAR: char: UNICHAR; seek: CARDINAL := 1):
>> BOOLEAN; (* required - start iterating with index=0, returns the next
>> logical character in char and returns true, or false if the char
>> doesn't exist. Index otherwise meaningless and private. Seek allows
>> skipping forward a number of characters. *)
>
> Do we need a colon after the second VAR?

No, it's a typo.

> This seems intended for an implementation where index is a byte  
> offset.
> Is it possible to copy index so as to start a new interatin where at a
> saved point?

Yes.

>
>
>>  getData(): ADDRESS; (* required - get the raw data, only valid
>> while on the stack *)
>
> What stack?

The execution stack. It might be a REFANY so would change if it didn't  
appear in the stack.

>
>
>>  setData(adr: ADDRESS; length: CARDINAL); (* required - set the data
>> and length of the encoded data *)
>
> length in bytes or characters?

Bytes, it's the length of the raw data and has no logical  
interpretation.

>
>
>>
>>  (* the remaining methods are optional for the encoding
>> implementation *)
>>  equal(t, u: T): BOOLEAN;
>>  compare(t1, t2: T): [-1..1];
>>  cat(t, u: T): T;
>>  sub(t: T; start: CARDINAL;length: CARDINAL := LAST(CARDINAL)): T;
>>  hash(t: T): Word.T;
>>  getChar(t: T; i: CARDINAL): CHAR;
>>  getWideChar(t: T; i: CARDINAL): WIDECHAR;
>>  getUniChar(t: T; i: CARDINAL): UNICHAR;
>
> I'd really like functions that provide access to the underlying UTF8
> encoding on a byte-by-byte basis.  Often that's the most efficient  
> way,
> and the simplest, to process UTF8-encoded data, and UTF8 had been
> designed with this in mind.

If you know the text is encoded as UTF-8 then you can get the raw data  
as above.

>
>
>>  setChars(VAR a: ARRAY OF CHAR; t: T; start: CARDINAL := 0);
>>  setWideChars(VAR a: ARRAY OF WIDECHAR; t: T; start: CARDINAL := 0);
>>  setUniChars(VAR a: ARRAY OF UNICHAR; t: T; start: CARDINAL := 0);
>>  fromChar(ch: CHAR): T;
>>  fromWIdeChar(ch: WIDECHAR): T;
>>  fromUniChar(ch: UNICHAR): T;
>>  fromChars(READONLY a: ARRAY OF CHAR): T;
>>  fromWIdeChars(READONLY a: ARRAY OF WIDECHAR): T;
>>  fromUniChars(READONLY a: ARRAY OF UNICHAR): T;
>>  findChar(t: T; c: CHAR; start := 0): INTEGER;
>>  findWideChar(t: T; c: WIDECHAR; start := 0): INTEGER;
>>  findUniChar(t: T; c: UNICHAR; start := 0): INTEGER;
>>  findCharR(t: T; c: CHAR; start := LAST(INTEGER)): INTEGER;
>>  findWideCharR(t: T; c: WIDECHAR; start := LAST(INTEGER)): INTEGER;
>>  findUniCharR(t: T; c: UNICHAR; start := LAST(INTEGER)): INTEGER;
>> END;
>>
>> Additionally the Text and a couple of other interfaces (eg Rd, Wr)
>> would need to be expanded to handle UniChar.
>>
>
> while we are at it, let's include byte access as well.  Treating bytes
> as integers between 0 and 255 would work.

Can't see the point since you can ORD a char value to get the same.

>
>
> -- hendrik
>
> Note: when I'm processing UTF8 data, I'm often parsing it and  
> inserting
> other tokens (which are not UTF8 characters) into the data stream as
> well.  It is convenient to use negative numbers for this, clearly
> distinguished from Unicode codepints, which are positive.  So making
> characters availabla as positive integers would work cleanly with  
> this.
>
>