[M3devel] This disgusting TEXT business

hendrik at topoi.pooq.com hendrik at topoi.pooq.com
Mon Dec 22 16:54:38 CET 2008


On Mon, Dec 22, 2008 at 06:07:21AM +0900, Darko wrote:
> The way to handle this would be to require an iterate function on each  
> text representation which successively returned the next character in  
> a string as a UNICHAR with defined characteristics. That would allow  
> for comparison and transfer of strings.
> 
> The way I see it, TEXT should be revealed to be a subtype of the  
> "TextObj" object that has methods that basically mirror the Text  
> interface with some additions. The existing text representation code  
> is adapted to this object (fairly simple) and new representations  
> subclass that object. A new function is added to text interface that  
> allows the user to specify the default representation used when  
> creating new strings. This would be the typecode of the object to  
> allocate, required to be a subtype of TextObj of course. Some TextObj  
> methods would be required to be implemented by the representation but  
> most could be handled by the ancestor. Many of the methods could be  
> implemented by the encoding for more efficient handling, for instance  
> when the text being compared or concatenated is the same representation.
> 
> UNICHAR = <enumeration containing all Unicode characters>;
> 
> TextObj = OBJECT METHODS
> 
>   length(t: T): CARDINAL; (* required to be implemented by the  
> encoding *)

How is T related to TextObj?  Does this return the number of characters 
of the number of bytes occupied by the stinrg?

>   empty(t: T): BOOLEAN; (* required to be implemented by the encoding  
> *)
>   hasWideChars(t: T): BOOLEAN; (* required - meaning containing  
> anything other than CHAR values *)

Does this require a scan of the string to determine whether any wide 
characters are actually present, or does it just indicate whether the 
data representation used in this string is capable of handling wide 
characters?

>   next(VAR index: INTEGER; VAR: char: UNICHAR; seek: CARDINAL := 1):  
> BOOLEAN; (* required - start iterating with index=0, returns the next  
> logical character in char and returns true, or false if the char  
> doesn't exist. Index otherwise meaningless and private. Seek allows  
> skipping forward a number of characters. *)

Do we need a colon after the second VAR?
This seems intended for an implementation where index is a byte offset.
Is it possible to copy index so as to start a new interatin where at a 
saved point?

>   getData(): ADDRESS; (* required - get the raw data, only valid  
> while on the stack *)

What stack?

>   setData(adr: ADDRESS; length: CARDINAL); (* required - set the data  
> and length of the encoded data *)

length in bytes or characters?

> 
>   (* the remaining methods are optional for the encoding  
> implementation *)
>   equal(t, u: T): BOOLEAN;
>   compare(t1, t2: T): [-1..1];
>   cat(t, u: T): T;
>   sub(t: T; start: CARDINAL;length: CARDINAL := LAST(CARDINAL)): T;
>   hash(t: T): Word.T;
>   getChar(t: T; i: CARDINAL): CHAR;
>   getWideChar(t: T; i: CARDINAL): WIDECHAR;
>   getUniChar(t: T; i: CARDINAL): UNICHAR;

I'd really like functions that provide access to the underlying UTF8 
encoding on a byte-by-byte basis.  Often that's the most efficient way, 
and the simplest, to process UTF8-encoded data, and UTF8 had been 
designed with this in mind.

>   setChars(VAR a: ARRAY OF CHAR; t: T; start: CARDINAL := 0);
>   setWideChars(VAR a: ARRAY OF WIDECHAR; t: T; start: CARDINAL := 0);
>   setUniChars(VAR a: ARRAY OF UNICHAR; t: T; start: CARDINAL := 0);
>   fromChar(ch: CHAR): T;
>   fromWIdeChar(ch: WIDECHAR): T;
>   fromUniChar(ch: UNICHAR): T;
>   fromChars(READONLY a: ARRAY OF CHAR): T;
>   fromWIdeChars(READONLY a: ARRAY OF WIDECHAR): T;
>   fromUniChars(READONLY a: ARRAY OF UNICHAR): T;
>   findChar(t: T; c: CHAR; start := 0): INTEGER;
>   findWideChar(t: T; c: WIDECHAR; start := 0): INTEGER;
>   findUniChar(t: T; c: UNICHAR; start := 0): INTEGER;
>   findCharR(t: T; c: CHAR; start := LAST(INTEGER)): INTEGER;
>   findWideCharR(t: T; c: WIDECHAR; start := LAST(INTEGER)): INTEGER;
>   findUniCharR(t: T; c: UNICHAR; start := LAST(INTEGER)): INTEGER;
> END;
> 
> Additionally the Text and a couple of other interfaces (eg Rd, Wr)  
> would need to be expanded to handle UniChar.
> 

while we are at it, let's include byte access as well.  Treating bytes 
as integers between 0 and 255 would work.

-- hendrik

Note: when I'm processing UTF8 data, I'm often parsing it and inserting 
other tokens (which are not UTF8 characters) into the data stream as 
well.  It is convenient to use negative numbers for this, clearly 
distinguished from Unicode codepints, which are positive.  So making 
characters availabla as positive integers would work cleanly with this.





More information about the M3devel mailing list