[M3devel] This disgusting TEXT business

Darko darko at darko.org
Mon Dec 22 01:14:16 CET 2008


A couple of corrections:

- T = TEXT and TextObj <: T so it should read TextObj = T OBJECT
- the cat method is required for the encoding implementation (helps to  
actually be able to create strings)
- the CARDINAL in the next method should be INTEGER to enable  
searching backwards (findCharR) and should probably be named 'iter'.

On 22/12/2008, at 6:07 AM, Darko wrote:

> The way to handle this would be to require an iterate function on  
> each text representation which successively returned the next  
> character in a string as a UNICHAR with defined characteristics.  
> That would allow for comparison and transfer of strings.
>
> The way I see it, TEXT should be revealed to be a subtype of the  
> "TextObj" object that has methods that basically mirror the Text  
> interface with some additions. The existing text representation code  
> is adapted to this object (fairly simple) and new representations  
> subclass that object. A new function is added to text interface that  
> allows the user to specify the default representation used when  
> creating new strings. This would be the typecode of the object to  
> allocate, required to be a subtype of TextObj of course. Some  
> TextObj methods would be required to be implemented by the  
> representation but most could be handled by the ancestor. Many of  
> the methods could be implemented by the encoding for more efficient  
> handling, for instance when the text being compared or concatenated  
> is the same representation.
>
> UNICHAR = <enumeration containing all Unicode characters>;
>
> TextObj = OBJECT METHODS
>
>  length(t: T): CARDINAL; (* required to be implemented by the  
> encoding *)
>  empty(t: T): BOOLEAN; (* required to be implemented by the encoding  
> *)
>  hasWideChars(t: T): BOOLEAN; (* required - meaning containing  
> anything other than CHAR values *)
>  next(VAR index: INTEGER; VAR: char: UNICHAR; seek: CARDINAL := 1):  
> BOOLEAN; (* required - start iterating with index=0, returns the  
> next logical character in char and returns true, or false if the  
> char doesn't exist. Index otherwise meaningless and private. Seek  
> allows skipping forward a number of characters. *)
>  getData(): ADDRESS; (* required - get the raw data, only valid  
> while on the stack *)
>  setData(adr: ADDRESS; length: CARDINAL); (* required - set the data  
> and length of the encoded data *)
>
>  (* the remaining methods are optional for the encoding  
> implementation *)
>  equal(t, u: T): BOOLEAN;
>  compare(t1, t2: T): [-1..1];
>  cat(t, u: T): T;
>  sub(t: T; start: CARDINAL;length: CARDINAL := LAST(CARDINAL)): T;
>  hash(t: T): Word.T;
>  getChar(t: T; i: CARDINAL): CHAR;
>  getWideChar(t: T; i: CARDINAL): WIDECHAR;
>  getUniChar(t: T; i: CARDINAL): UNICHAR;
>  setChars(VAR a: ARRAY OF CHAR; t: T; start: CARDINAL := 0);
>  setWideChars(VAR a: ARRAY OF WIDECHAR; t: T; start: CARDINAL := 0);
>  setUniChars(VAR a: ARRAY OF UNICHAR; t: T; start: CARDINAL := 0);
>  fromChar(ch: CHAR): T;
>  fromWIdeChar(ch: WIDECHAR): T;
>  fromUniChar(ch: UNICHAR): T;
>  fromChars(READONLY a: ARRAY OF CHAR): T;
>  fromWIdeChars(READONLY a: ARRAY OF WIDECHAR): T;
>  fromUniChars(READONLY a: ARRAY OF UNICHAR): T;
>  findChar(t: T; c: CHAR; start := 0): INTEGER;
>  findWideChar(t: T; c: WIDECHAR; start := 0): INTEGER;
>  findUniChar(t: T; c: UNICHAR; start := 0): INTEGER;
>  findCharR(t: T; c: CHAR; start := LAST(INTEGER)): INTEGER;
>  findWideCharR(t: T; c: WIDECHAR; start := LAST(INTEGER)): INTEGER;
>  findUniCharR(t: T; c: UNICHAR; start := LAST(INTEGER)): INTEGER;
> END;
>
> Additionally the Text and a couple of other interfaces (eg Rd, Wr)  
> would need to be expanded to handle UniChar.
>
>
>
> On 22/12/2008, at 12:40 AM, Stefan Sperling wrote:
>
>> On Sun, Dec 21, 2008 at 08:08:57AM +0900, Darko wrote:
>>> The right way to do this, IMNSHO is to not assume any particular
>>> representation of TEXT values and create an implementation interface
>>> that allows implementations of multiple text representations, much  
>>> like
>>> Rd and Wr don't make many assumptions about how data is actually  
>>> stored
>>> or retrieved.
>>
>> Such an interface may be needed for UTF-8 alone already, anyway,
>> because within UTF-8 there is in some cases more than one way
>> to store what amounts to the same data to a human user.
>>
>> In Subversion, from the beginning everyone agreed that the internal
>> encoding for all strings would be UTF-8. Most Subversion APIs expect
>> data in UTF-8. Strings (e.g. filenames) in the repository are stored
>> in UTF-8, etc. Great! Will work in all countries! Right?
>>
>> Yes, but not on all operating systems if you're not careful!
>> It did not occur to anyone at the time that there are characters
>> which in UTF-8 have more than one representation (codepoints) in a
>> byte stream. For example, an u with umlaut can be encoded as two
>> bytes or a single byte:
>>
>> 2 bytes: [u | the previous character has an umlaut ]
>> This is called "normal form decomposed".
>>
>> 1 byte [u umlaut] (i.e. ü if you can see this on your terminal :)
>> This is called "normal form composed".
>>
>> If you want to be portable, as CM3 and Subversion both try to be,
>> you have to consider that some operating systems may return your
>> filenames in a different encoding then you stored it in:
>>
>> --------
>>         Accepts   Gives back
>> MacOS X     *          NFD(*)
>> Linux       *        <input>
>> Windows     *        <input>
>> Others      ?           ?
>>
>>
>> *) There are some remarks to be made regarding full or partial
>> NFD here, but the essential thing is: If you send in NFC, don't
>> expect it back!
>> -------- quoted from:
>> http://svn.collab.net/repos/svn/trunk/notes/unicode-composition-for-filenames
>> which is worth a read for more details if you're interested.
>>
>> In Subversion, this is a real problem for Mac users, because
>> two filenames which only differ in their NFC/NFD encoding
>> look exactly the same to the user (an u umlaut is printed),
>> while the byte streams do not match ("We're sorry, but your
>> file x does not exist in the repository!", where x looks just
>> like a file that is clearly visible in the repository listing :)
>>
>> Subversion's problem now is that there are repositories out
>> there using filenames in either NFC, NFD, or mixed, and there
>> is no good way to reconcile the mess while staying backwards
>> compatible with existing clients, servers, working copies and
>> repositories. So Mac users are told to only use ASCII characters
>> in their filenames to prevent the problem (many users, especially
>> users who are not programmers, who store their photos or their
>> entire home directory or whatever in Subversion, are not happy
>> about this).
>>
>> This problem may not matter as much in case of CM3, but anyone
>> implementing UTF-8 support for CM3 should be aware of this issue
>> and not repeat the mistake the Subversion developers made at the
>> time! With UTF-8, do not rely on a filename to retain its encoding
>> as you passed it to the OS when requesting the filename from the
>> OS again.
>>
>> CM3 should pick either NFD or NFC as internal UTF-8 encoding, for
>> filenames only, or for all strings, whichever makes more sense.
>> And then stick to it, converting input/output as needed.
>>
>> Abstracting this problem away using a nice interface would probably
>> be the cleanest solution.
>>
>> Stefan
>




More information about the M3devel mailing list