[M3devel] This disgusting TEXT business
Darko
darko at darko.org
Mon Dec 22 01:07:39 CET 2008
On 23/12/2008, at 12:54 AM, hendrik at topoi.pooq.com wrote:
> On Mon, Dec 22, 2008 at 06:07:21AM +0900, Darko wrote:
>> The way to handle this would be to require an iterate function on
>> each
>> text representation which successively returned the next character in
>> a string as a UNICHAR with defined characteristics. That would allow
>> for comparison and transfer of strings.
>>
>> The way I see it, TEXT should be revealed to be a subtype of the
>> "TextObj" object that has methods that basically mirror the Text
>> interface with some additions. The existing text representation code
>> is adapted to this object (fairly simple) and new representations
>> subclass that object. A new function is added to text interface that
>> allows the user to specify the default representation used when
>> creating new strings. This would be the typecode of the object to
>> allocate, required to be a subtype of TextObj of course. Some TextObj
>> methods would be required to be implemented by the representation but
>> most could be handled by the ancestor. Many of the methods could be
>> implemented by the encoding for more efficient handling, for instance
>> when the text being compared or concatenated is the same
>> representation.
>>
>> UNICHAR = <enumeration containing all Unicode characters>;
>>
>> TextObj = OBJECT METHODS
>>
>> length(t: T): CARDINAL; (* required to be implemented by the
>> encoding *)
>
> How is T related to TextObj? Does this return the number of
> characters
> of the number of bytes occupied by the stinrg?
Sorry, T = TEXT. It returns the logical number of characters according
to some definition, which I imagine would be Unicode NFC (Normalised
Form, Canonical Composition).
>
>
>> empty(t: T): BOOLEAN; (* required to be implemented by the encoding
>> *)
>> hasWideChars(t: T): BOOLEAN; (* required - meaning containing
>> anything other than CHAR values *)
>
> Does this require a scan of the string to determine whether any wide
> characters are actually present, or does it just indicate whether the
> data representation used in this string is capable of handling wide
> characters?
No to both. It doesn't require a scan, you can cache this flag. It's
so you can possibly be more efficient when you know that you only have
to deal with CHAR values.
>
>
>> next(VAR index: INTEGER; VAR: char: UNICHAR; seek: CARDINAL := 1):
>> BOOLEAN; (* required - start iterating with index=0, returns the next
>> logical character in char and returns true, or false if the char
>> doesn't exist. Index otherwise meaningless and private. Seek allows
>> skipping forward a number of characters. *)
>
> Do we need a colon after the second VAR?
No, it's a typo.
> This seems intended for an implementation where index is a byte
> offset.
> Is it possible to copy index so as to start a new interatin where at a
> saved point?
Yes.
>
>
>> getData(): ADDRESS; (* required - get the raw data, only valid
>> while on the stack *)
>
> What stack?
The execution stack. It might be a REFANY so would change if it didn't
appear in the stack.
>
>
>> setData(adr: ADDRESS; length: CARDINAL); (* required - set the data
>> and length of the encoded data *)
>
> length in bytes or characters?
Bytes, it's the length of the raw data and has no logical
interpretation.
>
>
>>
>> (* the remaining methods are optional for the encoding
>> implementation *)
>> equal(t, u: T): BOOLEAN;
>> compare(t1, t2: T): [-1..1];
>> cat(t, u: T): T;
>> sub(t: T; start: CARDINAL;length: CARDINAL := LAST(CARDINAL)): T;
>> hash(t: T): Word.T;
>> getChar(t: T; i: CARDINAL): CHAR;
>> getWideChar(t: T; i: CARDINAL): WIDECHAR;
>> getUniChar(t: T; i: CARDINAL): UNICHAR;
>
> I'd really like functions that provide access to the underlying UTF8
> encoding on a byte-by-byte basis. Often that's the most efficient
> way,
> and the simplest, to process UTF8-encoded data, and UTF8 had been
> designed with this in mind.
If you know the text is encoded as UTF-8 then you can get the raw data
as above.
>
>
>> setChars(VAR a: ARRAY OF CHAR; t: T; start: CARDINAL := 0);
>> setWideChars(VAR a: ARRAY OF WIDECHAR; t: T; start: CARDINAL := 0);
>> setUniChars(VAR a: ARRAY OF UNICHAR; t: T; start: CARDINAL := 0);
>> fromChar(ch: CHAR): T;
>> fromWIdeChar(ch: WIDECHAR): T;
>> fromUniChar(ch: UNICHAR): T;
>> fromChars(READONLY a: ARRAY OF CHAR): T;
>> fromWIdeChars(READONLY a: ARRAY OF WIDECHAR): T;
>> fromUniChars(READONLY a: ARRAY OF UNICHAR): T;
>> findChar(t: T; c: CHAR; start := 0): INTEGER;
>> findWideChar(t: T; c: WIDECHAR; start := 0): INTEGER;
>> findUniChar(t: T; c: UNICHAR; start := 0): INTEGER;
>> findCharR(t: T; c: CHAR; start := LAST(INTEGER)): INTEGER;
>> findWideCharR(t: T; c: WIDECHAR; start := LAST(INTEGER)): INTEGER;
>> findUniCharR(t: T; c: UNICHAR; start := LAST(INTEGER)): INTEGER;
>> END;
>>
>> Additionally the Text and a couple of other interfaces (eg Rd, Wr)
>> would need to be expanded to handle UniChar.
>>
>
> while we are at it, let's include byte access as well. Treating bytes
> as integers between 0 and 255 would work.
Can't see the point since you can ORD a char value to get the same.
>
>
> -- hendrik
>
> Note: when I'm processing UTF8 data, I'm often parsing it and
> inserting
> other tokens (which are not UTF8 characters) into the data stream as
> well. It is convenient to use negative numbers for this, clearly
> distinguished from Unicode codepints, which are positive. So making
> characters availabla as positive integers would work cleanly with
> this.
>
>
More information about the M3devel
mailing list