[M3devel] This disgusting TEXT business

Darko darko at darko.org
Sun Dec 21 22:07:21 CET 2008


The way to handle this would be to require an iterate function on each  
text representation which successively returned the next character in  
a string as a UNICHAR with defined characteristics. That would allow  
for comparison and transfer of strings.

The way I see it, TEXT should be revealed to be a subtype of the  
"TextObj" object that has methods that basically mirror the Text  
interface with some additions. The existing text representation code  
is adapted to this object (fairly simple) and new representations  
subclass that object. A new function is added to text interface that  
allows the user to specify the default representation used when  
creating new strings. This would be the typecode of the object to  
allocate, required to be a subtype of TextObj of course. Some TextObj  
methods would be required to be implemented by the representation but  
most could be handled by the ancestor. Many of the methods could be  
implemented by the encoding for more efficient handling, for instance  
when the text being compared or concatenated is the same representation.

UNICHAR = <enumeration containing all Unicode characters>;

TextObj = OBJECT METHODS

   length(t: T): CARDINAL; (* required to be implemented by the  
encoding *)
   empty(t: T): BOOLEAN; (* required to be implemented by the encoding  
*)
   hasWideChars(t: T): BOOLEAN; (* required - meaning containing  
anything other than CHAR values *)
   next(VAR index: INTEGER; VAR: char: UNICHAR; seek: CARDINAL := 1):  
BOOLEAN; (* required - start iterating with index=0, returns the next  
logical character in char and returns true, or false if the char  
doesn't exist. Index otherwise meaningless and private. Seek allows  
skipping forward a number of characters. *)
   getData(): ADDRESS; (* required - get the raw data, only valid  
while on the stack *)
   setData(adr: ADDRESS; length: CARDINAL); (* required - set the data  
and length of the encoded data *)

   (* the remaining methods are optional for the encoding  
implementation *)
   equal(t, u: T): BOOLEAN;
   compare(t1, t2: T): [-1..1];
   cat(t, u: T): T;
   sub(t: T; start: CARDINAL;length: CARDINAL := LAST(CARDINAL)): T;
   hash(t: T): Word.T;
   getChar(t: T; i: CARDINAL): CHAR;
   getWideChar(t: T; i: CARDINAL): WIDECHAR;
   getUniChar(t: T; i: CARDINAL): UNICHAR;
   setChars(VAR a: ARRAY OF CHAR; t: T; start: CARDINAL := 0);
   setWideChars(VAR a: ARRAY OF WIDECHAR; t: T; start: CARDINAL := 0);
   setUniChars(VAR a: ARRAY OF UNICHAR; t: T; start: CARDINAL := 0);
   fromChar(ch: CHAR): T;
   fromWIdeChar(ch: WIDECHAR): T;
   fromUniChar(ch: UNICHAR): T;
   fromChars(READONLY a: ARRAY OF CHAR): T;
   fromWIdeChars(READONLY a: ARRAY OF WIDECHAR): T;
   fromUniChars(READONLY a: ARRAY OF UNICHAR): T;
   findChar(t: T; c: CHAR; start := 0): INTEGER;
   findWideChar(t: T; c: WIDECHAR; start := 0): INTEGER;
   findUniChar(t: T; c: UNICHAR; start := 0): INTEGER;
   findCharR(t: T; c: CHAR; start := LAST(INTEGER)): INTEGER;
   findWideCharR(t: T; c: WIDECHAR; start := LAST(INTEGER)): INTEGER;
   findUniCharR(t: T; c: UNICHAR; start := LAST(INTEGER)): INTEGER;
END;

Additionally the Text and a couple of other interfaces (eg Rd, Wr)  
would need to be expanded to handle UniChar.



On 22/12/2008, at 12:40 AM, Stefan Sperling wrote:

> On Sun, Dec 21, 2008 at 08:08:57AM +0900, Darko wrote:
>> The right way to do this, IMNSHO is to not assume any particular
>> representation of TEXT values and create an implementation interface
>> that allows implementations of multiple text representations, much  
>> like
>> Rd and Wr don't make many assumptions about how data is actually  
>> stored
>> or retrieved.
>
> Such an interface may be needed for UTF-8 alone already, anyway,
> because within UTF-8 there is in some cases more than one way
> to store what amounts to the same data to a human user.
>
> In Subversion, from the beginning everyone agreed that the internal
> encoding for all strings would be UTF-8. Most Subversion APIs expect
> data in UTF-8. Strings (e.g. filenames) in the repository are stored
> in UTF-8, etc. Great! Will work in all countries! Right?
>
> Yes, but not on all operating systems if you're not careful!
> It did not occur to anyone at the time that there are characters
> which in UTF-8 have more than one representation (codepoints) in a
> byte stream. For example, an u with umlaut can be encoded as two
> bytes or a single byte:
>
>  2 bytes: [u | the previous character has an umlaut ]
>  This is called "normal form decomposed".
>
>  1 byte [u umlaut] (i.e. ü if you can see this on your terminal :)
>  This is called "normal form composed".
>
> If you want to be portable, as CM3 and Subversion both try to be,
> you have to consider that some operating systems may return your
> filenames in a different encoding then you stored it in:
>
> --------
>          Accepts   Gives back
> MacOS X     *          NFD(*)
> Linux       *        <input>
> Windows     *        <input>
> Others      ?           ?
>
>
> *) There are some remarks to be made regarding full or partial
>  NFD here, but the essential thing is: If you send in NFC, don't
>  expect it back!
> -------- quoted from:
> http://svn.collab.net/repos/svn/trunk/notes/unicode-composition-for-filenames
> which is worth a read for more details if you're interested.
>
> In Subversion, this is a real problem for Mac users, because
> two filenames which only differ in their NFC/NFD encoding
> look exactly the same to the user (an u umlaut is printed),
> while the byte streams do not match ("We're sorry, but your
> file x does not exist in the repository!", where x looks just
> like a file that is clearly visible in the repository listing :)
>
> Subversion's problem now is that there are repositories out
> there using filenames in either NFC, NFD, or mixed, and there
> is no good way to reconcile the mess while staying backwards
> compatible with existing clients, servers, working copies and
> repositories. So Mac users are told to only use ASCII characters
> in their filenames to prevent the problem (many users, especially
> users who are not programmers, who store their photos or their
> entire home directory or whatever in Subversion, are not happy
> about this).
>
> This problem may not matter as much in case of CM3, but anyone
> implementing UTF-8 support for CM3 should be aware of this issue
> and not repeat the mistake the Subversion developers made at the
> time! With UTF-8, do not rely on a filename to retain its encoding
> as you passed it to the OS when requesting the filename from the
> OS again.
>
> CM3 should pick either NFD or NFC as internal UTF-8 encoding, for
> filenames only, or for all strings, whichever makes more sense.
> And then stick to it, converting input/output as needed.
>
> Abstracting this problem away using a nice interface would probably
> be the cleanest solution.
>
> Stefan




More information about the M3devel mailing list