[M3devel] This disgusting TEXT business
Darko
darko at darko.org
Mon Dec 22 01:14:16 CET 2008
A couple of corrections:
- T = TEXT and TextObj <: T so it should read TextObj = T OBJECT
- the cat method is required for the encoding implementation (helps to
actually be able to create strings)
- the CARDINAL in the next method should be INTEGER to enable
searching backwards (findCharR) and should probably be named 'iter'.
On 22/12/2008, at 6:07 AM, Darko wrote:
> The way to handle this would be to require an iterate function on
> each text representation which successively returned the next
> character in a string as a UNICHAR with defined characteristics.
> That would allow for comparison and transfer of strings.
>
> The way I see it, TEXT should be revealed to be a subtype of the
> "TextObj" object that has methods that basically mirror the Text
> interface with some additions. The existing text representation code
> is adapted to this object (fairly simple) and new representations
> subclass that object. A new function is added to text interface that
> allows the user to specify the default representation used when
> creating new strings. This would be the typecode of the object to
> allocate, required to be a subtype of TextObj of course. Some
> TextObj methods would be required to be implemented by the
> representation but most could be handled by the ancestor. Many of
> the methods could be implemented by the encoding for more efficient
> handling, for instance when the text being compared or concatenated
> is the same representation.
>
> UNICHAR = <enumeration containing all Unicode characters>;
>
> TextObj = OBJECT METHODS
>
> length(t: T): CARDINAL; (* required to be implemented by the
> encoding *)
> empty(t: T): BOOLEAN; (* required to be implemented by the encoding
> *)
> hasWideChars(t: T): BOOLEAN; (* required - meaning containing
> anything other than CHAR values *)
> next(VAR index: INTEGER; VAR: char: UNICHAR; seek: CARDINAL := 1):
> BOOLEAN; (* required - start iterating with index=0, returns the
> next logical character in char and returns true, or false if the
> char doesn't exist. Index otherwise meaningless and private. Seek
> allows skipping forward a number of characters. *)
> getData(): ADDRESS; (* required - get the raw data, only valid
> while on the stack *)
> setData(adr: ADDRESS; length: CARDINAL); (* required - set the data
> and length of the encoded data *)
>
> (* the remaining methods are optional for the encoding
> implementation *)
> equal(t, u: T): BOOLEAN;
> compare(t1, t2: T): [-1..1];
> cat(t, u: T): T;
> sub(t: T; start: CARDINAL;length: CARDINAL := LAST(CARDINAL)): T;
> hash(t: T): Word.T;
> getChar(t: T; i: CARDINAL): CHAR;
> getWideChar(t: T; i: CARDINAL): WIDECHAR;
> getUniChar(t: T; i: CARDINAL): UNICHAR;
> setChars(VAR a: ARRAY OF CHAR; t: T; start: CARDINAL := 0);
> setWideChars(VAR a: ARRAY OF WIDECHAR; t: T; start: CARDINAL := 0);
> setUniChars(VAR a: ARRAY OF UNICHAR; t: T; start: CARDINAL := 0);
> fromChar(ch: CHAR): T;
> fromWIdeChar(ch: WIDECHAR): T;
> fromUniChar(ch: UNICHAR): T;
> fromChars(READONLY a: ARRAY OF CHAR): T;
> fromWIdeChars(READONLY a: ARRAY OF WIDECHAR): T;
> fromUniChars(READONLY a: ARRAY OF UNICHAR): T;
> findChar(t: T; c: CHAR; start := 0): INTEGER;
> findWideChar(t: T; c: WIDECHAR; start := 0): INTEGER;
> findUniChar(t: T; c: UNICHAR; start := 0): INTEGER;
> findCharR(t: T; c: CHAR; start := LAST(INTEGER)): INTEGER;
> findWideCharR(t: T; c: WIDECHAR; start := LAST(INTEGER)): INTEGER;
> findUniCharR(t: T; c: UNICHAR; start := LAST(INTEGER)): INTEGER;
> END;
>
> Additionally the Text and a couple of other interfaces (eg Rd, Wr)
> would need to be expanded to handle UniChar.
>
>
>
> On 22/12/2008, at 12:40 AM, Stefan Sperling wrote:
>
>> On Sun, Dec 21, 2008 at 08:08:57AM +0900, Darko wrote:
>>> The right way to do this, IMNSHO is to not assume any particular
>>> representation of TEXT values and create an implementation interface
>>> that allows implementations of multiple text representations, much
>>> like
>>> Rd and Wr don't make many assumptions about how data is actually
>>> stored
>>> or retrieved.
>>
>> Such an interface may be needed for UTF-8 alone already, anyway,
>> because within UTF-8 there is in some cases more than one way
>> to store what amounts to the same data to a human user.
>>
>> In Subversion, from the beginning everyone agreed that the internal
>> encoding for all strings would be UTF-8. Most Subversion APIs expect
>> data in UTF-8. Strings (e.g. filenames) in the repository are stored
>> in UTF-8, etc. Great! Will work in all countries! Right?
>>
>> Yes, but not on all operating systems if you're not careful!
>> It did not occur to anyone at the time that there are characters
>> which in UTF-8 have more than one representation (codepoints) in a
>> byte stream. For example, an u with umlaut can be encoded as two
>> bytes or a single byte:
>>
>> 2 bytes: [u | the previous character has an umlaut ]
>> This is called "normal form decomposed".
>>
>> 1 byte [u umlaut] (i.e. ü if you can see this on your terminal :)
>> This is called "normal form composed".
>>
>> If you want to be portable, as CM3 and Subversion both try to be,
>> you have to consider that some operating systems may return your
>> filenames in a different encoding then you stored it in:
>>
>> --------
>> Accepts Gives back
>> MacOS X * NFD(*)
>> Linux * <input>
>> Windows * <input>
>> Others ? ?
>>
>>
>> *) There are some remarks to be made regarding full or partial
>> NFD here, but the essential thing is: If you send in NFC, don't
>> expect it back!
>> -------- quoted from:
>> http://svn.collab.net/repos/svn/trunk/notes/unicode-composition-for-filenames
>> which is worth a read for more details if you're interested.
>>
>> In Subversion, this is a real problem for Mac users, because
>> two filenames which only differ in their NFC/NFD encoding
>> look exactly the same to the user (an u umlaut is printed),
>> while the byte streams do not match ("We're sorry, but your
>> file x does not exist in the repository!", where x looks just
>> like a file that is clearly visible in the repository listing :)
>>
>> Subversion's problem now is that there are repositories out
>> there using filenames in either NFC, NFD, or mixed, and there
>> is no good way to reconcile the mess while staying backwards
>> compatible with existing clients, servers, working copies and
>> repositories. So Mac users are told to only use ASCII characters
>> in their filenames to prevent the problem (many users, especially
>> users who are not programmers, who store their photos or their
>> entire home directory or whatever in Subversion, are not happy
>> about this).
>>
>> This problem may not matter as much in case of CM3, but anyone
>> implementing UTF-8 support for CM3 should be aware of this issue
>> and not repeat the mistake the Subversion developers made at the
>> time! With UTF-8, do not rely on a filename to retain its encoding
>> as you passed it to the OS when requesting the filename from the
>> OS again.
>>
>> CM3 should pick either NFD or NFC as internal UTF-8 encoding, for
>> filenames only, or for all strings, whichever makes more sense.
>> And then stick to it, converting input/output as needed.
>>
>> Abstracting this problem away using a nice interface would probably
>> be the cleanest solution.
>>
>> Stefan
>
More information about the M3devel
mailing list