[M3devel] UTF-8 TEXT

Rodney M. Bates rodney_bates at lcwb.coop
Fri Jul 6 19:54:32 CEST 2012



On 07/02/2012 10:57 AM, Tony Hosking wrote:
>
> On Jul 2, 2012, at 10:50 AM, Rodney Bates wrote:
>
>>
>>
>> -Rodney Bates
>>
>> --- antony.hosking at gmail.com <mailto:antony.hosking at gmail.com> wrote:
>>
>>> From: Antony Hosking <antony.hosking at gmail.com <mailto:antony.hosking at gmail.com>>
>>> To: "Rodney M. Bates" <rodney_bates at lcwb.coop <mailto:rodney_bates at lcwb.coop>>
>>> Cc: "m3devel at elegosoft.com <mailto:m3devel at elegosoft.com>" <m3devel at elegosoft.com <mailto:m3devel at elegosoft.com>>
>>> Subject: Re: [M3devel] UTF-8 TEXT
>>> Date: Thu, 28 Jun 2012 10:37:36 -0400
>>>
>>> Why not simply say that CHAR is an enumeration representing all of UTF-32?
>>> The current definition merely says that CHAR is an enumeration containing *at least* 256 elements.
>>> We would need to translate the current Latin-1 literals into UTF-32.
>>> And we could simply have a new literal form for Unicode literals.
>>>
>> This is almost what I would propose to do, with a couple of differences:
>>
>> Leave CHAR alone and fix WIDECHAR to handle the entire Unicode space.
>> I am sure there is lots of existing code that depends on the implementation
>> properties: ORD(FIRST(CHAR))=0, ORD(LAST(CHAR))=255, and BYTESIZE(CHAR)=1.
>
> Fair enough. Would we leave the encoding of CHAR as ISO-Latin-1? We’d still need translation from ISO-Latin-1 to UTF-8 wouldn’t we?
>

Yes.  The code points for Unicode and ISO-Latin-1, in the range 128..255 map to
the same characters, (as in 0..127).  But the physical encoding is different.
ISO-Latin-1 is encoded one byte per character unconditionally.  When Unicode
is encoded in UTF-8, any code point 128 or more uses at least two bytes.

We need translations, but these belong in Wr/Rd and friends, which handle serial
streams.  In in-memory variables, WIDECHAR holds a Unicode code point, ARRAY OF WIDECHAR
would happen to be the same representation as UTF-32, and Text.T would abstract
away the internal representation.

>> Then I would define, in the language itself, that WIDECHAR is Unicode, not
>> UTF-32. Thus ORD(LAST(WIDECHAR))=16_10FFFF. Then I would make it an
>> implementation characteristic that BYTESIZE(WIDECHAR))=4.
>
> I note this text from the Wikipedia entry for UTF-32:
>
>     Though a fixed number of bytes per code point appear convenient, it is not as useful as it appears. It makes truncation easier but not significantly so compared to UTF-8 <http://en.wikipedia.org/wiki/UTF-8> andUTF-16 <http://en.wikipedia.org/wiki/UTF-16>. It does not make it faster to find a particular offset in the string, as an "offset" can be measured in the fixed-size code units of any encoding. It does not make calculating the displayed width of a string easier except in limited cases, since even with a “fixed width” font there may be more than one code point per character position (combining marks <http://en.wikipedia.org/wiki/Combining_character>) or more than one character position per code point (for example CJK <http://en.wikipedia.org/wiki/CJK> ideographs). Combining marks mean editors cannot treat one code point as being the same as one unit for editing. Editors that limit themselves to left-to-right languages and precomposed characters
>     <http://en.wikipedia.org/wiki/Precomposed_character> can take advantage of fixed-sized code units, but such editors are unlikely to support non-BMP characters and thus can work equally well with 16-bit UTF-16 <http://en.wikipedia.org/wiki/UTF-16> encoding.
>
>
> Does this argue against WIDECHAR=UTF-32? Would we be better off simply saying WIDECHAR=UTF-16 and leaving things as they are? Yes, it would make the definition of WideCharAt a little odd, because the index would be defined in 16-bit units rather than UTF-16 glyph code-points.
>

No.  Keeping WIDECHAR at only 2^16 values does nothing to get us out of the morass we are
now in where every bit of character-manipulating code has to cope with different encodings
and/or variable-sized encodings.  If we make WIDECHAR capable of holding any Unicode code point,
then we have the possibility of dealing with characters in the same abstractions as we
originally had, and, with only an 8-bit character set, still do  Specifically, we have a
variable type that holds any character, arrays thereof, and a very general functional style
package of strings thereof.  Library streams can handle encoding transformations, and most
code won't have to worry about them, beyond specifying once what encoding it wants.

Of course, you could still always do low-level stuff like putting one UTF-8 code _unit_ into
a WIDECHAR or CHAR, having arrays or TEXTs thereof, and constantly fiddling with the encoding.
But this should not be required.

> By the way, if we did change WIDECHAR to an enumeration containing 16_110000 elements then the stored (memory) size of WIDECHAR would be 4 bytes given the current CM3 implementation of enumerations, which chooses a (non-PACKED) stored size of 1/2/4/8 bytes depending on the number of elements.
>

I have thought about making BYTESIZE(WIDECHAR) = 3, but that would at best trade
one group of problems for another.  In particular, applying ORD functions and doing
arithmetic on characters located in arrays (including those hidden inside Text) would
always involve repacking to get things aligned.  I would think we would at least want
to keep WIDECHAR scalars aligned.

>>
>> On Jun 27, 2012, at 10:12 PM, Rodney M. Bates wrote:
>>
>>>
>>>
>>> On 06/27/2012 07:32 PM, Antony Hosking wrote:
>>>> So what do we do about 6-byte UTF-8 code points? They won't fit in WIDECHAR. Surely we should allow accessing a UTF-8 character as a CARDINAL and be done with it?
>>>>
>>>
>>> Absolutely. Except I think a better way is to make WIDECHAR big enough to hold all of
>>> Unicode.
>>>
>>>> Sent from my iPad
>>>>
>>>> On Jun 27, 2012, at 3:20 PM, "Rodney M. Bates"<rodney_bates at lcwb.coop <mailto:rodney_bates at lcwb.coop>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 06/26/2012 10:30 PM, Hendrik Boom wrote:
>>>>>> On Tue, Jun 26, 2012 at 04:22:22PM -0400, Coleburn, Randy wrote:
>>>>>>> I seem to recall that Rodney did some work a while back relating to TEXT.
>>>>>>> Rodney, can you weigh in on some of this?
>>>>>>> --Randy Coleburn
>>>>>>>
>>>>>>> From: Dragiša Durić [mailto:dragisha at m3w.org]
>>>>>>> Sent: Tuesday, June 26, 2012 12:46 PM
>>>>>>> To: Jay
>>>>>>> Cc: m3devel
>>>>>>> Subject: EXT Re: [M3devel] AND (., 16_ff). Not serious - or so I hope!
>>>>>>>
>>>>>>> You had idea in other message. Store length!
>>>>>>>
>>>>>>> Another idea - store partial list of indices to character locations. So whatever one does, that list can be used/expanded. Whatever storage issues this makes, they are probably minor as compared to 32bit WIDECHAR for all idea.
>>>>>>
>>>>>> Most of the time, you don't need explicit integer indexes to character
>>>>>> locations. What you do need is an operation that fetches a character
>>>>>> given the string and its index (whatever data structure that index is),
>>>>>> and one that increments the index past that character. As long as you
>>>>>> can save an index and use it later on the same string, that's probably
>>>>>> all you ever need. And with a simple TEXT representation (such as the
>>>>>> obvious array of bytes containing characters of various widths) a byte
>>>>>> index is all you need (note: NOT a character index). It's easy even to
>>>>>> use TEXT and its integer indices as the data representation, as long as
>>>>>> you use the proper functions parse the characters and increment the
>>>>>> indices by amounts that might differ from 1.
>>>>>>
>>>>>> And if your source code is represented in UTF-8, the representation that
>>>>>> requires little extra compiler effort to parse, your TEXT strings will
>>>>>> automagically appear in UTF-8.
>>>>>
>>>>> The original designers of the language and its libraries have given us
>>>>> two different abstractions for handling character strings (in addition
>>>>> to plain arrays.) 1) Text, and 2) Wr, Rd, and their cousins.
>>>>>
>>>>> Text is highly general and easy to use. Concatentations and substrings
>>>>> are easy. Semantics, to its clients, are value semantics, similar to INTEGER.
>>>>> Random access by *character* number is easy and, hopefully, implemented
>>>>> with efficiency at least better than O(n).
>>>>>
>>>>> Wr and friends restrict you to sequential access, at least mostly, but
>>>>> gain implementation convenience and efficiency as a result.
>>>>>
>>>>> I feel very stongly that we should *not* take away the full generality
>>>>> of Text, especially efficient random access, to handle variable-length
>>>>> character encodings in strings. For these, lets make more friends of
>>>>> Wr and Rd, which already assume sequential access. For example, a
>>>>> filter pipe that sequentially reads a Text/Array/stream, applies a UTF-8
>>>>> interpretation to its bytes, and delivers a stream of Unicode characters,
>>>>> in variables of type WIDECHAR.
>>>>>
>>>>> Text should preserve the abstraction that it's a string of characters,
>>>>> generalized as it already is in cm3, to have type WIDECHAR, so they can be any
>>>>> Unicode character. The internal representation should, usually, not be
>>>>> of concern.
>>>>>
>>>>> Note that nowhere in Text are character values transferred between
>>>>> a Text.T and any form of I/O stream. In the Text abstraction, all
>>>>> characters go in and out of a Text.T in variables of type CHAR,
>>>>> WIDECHAR, and arrays thereof. IO, etc. is only done in streams,
>>>>> e.g, TextWr. We can easily add new variants of these that encode/decode
>>>>> by various rules.
>>>>>
>>>>> Of course, it is still valid to put a string of bytes in a Text.T and
>>>>> apply, e.g., UTF-8 interpretation yourself. But that's lower-level
>>>>> programming, and shouldn't confuse the abstraction.
>>>>>
>>>>>>
>>>>>> I can see a use for various wide characters -- the things you extract
>>>>>> from a TEXT by parsing biits of it, but none for anything
>>>>>> really new complicated for wide TEXT.
>>>>>>
>>>>>> The only confusing thing is that the existing operations for extracting
>>>>>> bytes from TEXT have names that suggest they are extracting characters.
>>>>>>
>>>>>
>>>>> I think it's more than a suggestion. I think the abstraction clearly
>>>>> considers them characters. And it should stay that way. If you want,
>>>>> at a higher level of code, to treat them as bytes, that's fine, but the
>>>>> abstraction continues to view them as characters (which only you, the
>>>>> client, know is not really so.)
>>>>>
>>>>>> -- Hendrik
>>>>>>
>>>>
>>
>>
>>
>
>



More information about the M3devel mailing list