[M3devel] UTF-8 TEXT

Tony Hosking hosking at cs.purdue.edu
Mon Jul 2 17:57:14 CEST 2012


On Jul 2, 2012, at 10:50 AM, Rodney Bates wrote:

> 
> 
> -Rodney Bates
> 
> --- antony.hosking at gmail.com wrote:
> 
>> From: Antony Hosking <antony.hosking at gmail.com>
>> To: "Rodney M. Bates" <rodney_bates at lcwb.coop>
>> Cc: "m3devel at elegosoft.com" <m3devel at elegosoft.com>
>> Subject: Re: [M3devel] UTF-8 TEXT
>> Date: Thu, 28 Jun 2012 10:37:36 -0400
>> 
>> Why not simply say that CHAR is an enumeration representing all of UTF-32?
>> The current definition merely says that CHAR is an enumeration containing *at least* 256 elements.
>> We would need to translate the current Latin-1 literals into UTF-32.
>> And we could simply have a new literal form for Unicode literals.
>> 
> This is almost what I would propose to do, with a couple of differences:
> 
> Leave CHAR alone and fix WIDECHAR to handle the entire Unicode space.
> I am sure there is lots of existing code that depends on the implementation
> properties: ORD(FIRST(CHAR))=0, ORD(LAST(CHAR))=255, and BYTESIZE(CHAR)=1.

Fair enough.  Would we leave the encoding of CHAR as ISO-Latin-1?  We’d still need translation from ISO-Latin-1 to UTF-8 wouldn’t we?

> Then I would define, in the language itself, that WIDECHAR is Unicode, not
> UTF-32.  Thus ORD(LAST(WIDECHAR))=16_10FFFF. Then I would make it an
> implementation characteristic that BYTESIZE(WIDECHAR))=4.

I note this text from the Wikipedia entry for UTF-32:

Though a fixed number of bytes per code point appear convenient, it is not as useful as it appears. It makes truncation easier but not significantly so compared to UTF-8 andUTF-16. It does not make it faster to find a particular offset in the string, as an "offset" can be measured in the fixed-size code units of any encoding. It does not make calculating the displayed width of a string easier except in limited cases, since even with a “fixed width” font there may be more than one code point per character position (combining marks) or more than one character position per code point (for example CJK ideographs). Combining marks mean editors cannot treat one code point as being the same as one unit for editing. Editors that limit themselves to left-to-right languages and precomposed characters can take advantage of fixed-sized code units, but such editors are unlikely to support non-BMP characters and thus can work equally well with 16-bit UTF-16 encoding.

Does this argue against WIDECHAR=UTF-32?  Would we be better off simply saying WIDECHAR=UTF-16 and leaving things as they are?  Yes, it would make the definition of WideCharAt a little odd, because the index would be defined in 16-bit units rather than UTF-16 glyph code-points.

By the way, if we did change WIDECHAR to an enumeration containing 16_110000 elements then the stored (memory) size of WIDECHAR would be 4 bytes given the current CM3 implementation of enumerations, which chooses a (non-PACKED) stored size of 1/2/4/8 bytes depending on the number of elements.

> 
> On Jun 27, 2012, at 10:12 PM, Rodney M. Bates wrote:
> 
>> 
>> 
>> On 06/27/2012 07:32 PM, Antony Hosking wrote:
>>> So what do we do about 6-byte UTF-8 code points?  They won't fit in WIDECHAR.  Surely we should allow accessing a UTF-8 character as a CARDINAL and be done with it?
>>> 
>> 
>> Absolutely.  Except I think a better way is to make WIDECHAR big enough to hold all of
>> Unicode.
>> 
>>> Sent from my iPad
>>> 
>>> On Jun 27, 2012, at 3:20 PM, "Rodney M. Bates"<rodney_bates at lcwb.coop>  wrote:
>>> 
>>>> 
>>>> 
>>>> On 06/26/2012 10:30 PM, Hendrik Boom wrote:
>>>>> On Tue, Jun 26, 2012 at 04:22:22PM -0400, Coleburn, Randy wrote:
>>>>>> I seem to recall that Rodney did some work a while back relating to TEXT.
>>>>>> Rodney, can you weigh in on some of this?
>>>>>> --Randy Coleburn
>>>>>> 
>>>>>> From: Dragiša Durić [mailto:dragisha at m3w.org]
>>>>>> Sent: Tuesday, June 26, 2012 12:46 PM
>>>>>> To: Jay
>>>>>> Cc: m3devel
>>>>>> Subject: EXT Re: [M3devel] AND (., 16_ff). Not serious - or so I hope!
>>>>>> 
>>>>>> You had idea in other message. Store length!
>>>>>> 
>>>>>> Another idea - store partial list of indices to character locations. So whatever one does, that list can be used/expanded. Whatever storage issues this makes, they are probably minor as compared to 32bit WIDECHAR for all idea.
>>>>> 
>>>>> Most of the time, you don't need explicit integer indexes to character
>>>>> locations.  What you do need is an operation that fetches a character
>>>>> given the string and its index (whatever data structure that index is),
>>>>> and  one that increments the index past that character.  As long as you
>>>>> can save an index and use it later on the same string, that's probably
>>>>> all you ever need.  And with a simple TEXT representation (such as the
>>>>> obvious array of bytes containing characters of various widths) a byte
>>>>> index is all you need (note: NOT a character index).  It's easy even to
>>>>> use TEXT and its integer indices as the data representation, as long as
>>>>> you use the proper functions parse the characters and increment the
>>>>> indices by amounts that might differ from 1.
>>>>> 
>>>>> And if your source code is represented in UTF-8, the representation that
>>>>> requires little extra compiler effort to parse,  your TEXT strings will
>>>>> automagically appear in UTF-8.
>>>> 
>>>> The original designers of the language and its libraries have given us
>>>> two different abstractions for handling character strings (in addition
>>>> to plain arrays.)  1) Text, and 2) Wr, Rd, and their cousins.
>>>> 
>>>> Text is highly general and easy to use.  Concatentations and substrings
>>>> are easy.  Semantics, to its clients, are value semantics, similar to INTEGER.
>>>> Random access by *character* number is easy and, hopefully, implemented
>>>> with efficiency at least better than O(n).
>>>> 
>>>> Wr and friends restrict you to sequential access, at least mostly, but
>>>> gain implementation convenience and efficiency as a result.
>>>> 
>>>> I feel very stongly that we should *not* take away the full generality
>>>> of Text, especially efficient random access, to handle variable-length
>>>> character encodings in strings.  For these, lets make more friends of
>>>> Wr and Rd, which already assume sequential access.  For example, a
>>>> filter pipe that sequentially reads a Text/Array/stream, applies a UTF-8
>>>> interpretation to its bytes, and delivers a stream of Unicode characters,
>>>> in variables of type WIDECHAR.
>>>> 
>>>> Text should preserve the abstraction that it's a string of characters,
>>>> generalized as it already is in cm3, to have type WIDECHAR, so they can be any
>>>> Unicode character.  The internal representation should, usually, not be
>>>> of concern.
>>>> 
>>>> Note that nowhere in Text are character values transferred between
>>>> a Text.T and any form of I/O stream.  In the Text abstraction, all
>>>> characters go in and out of a Text.T in variables of type CHAR,
>>>> WIDECHAR, and arrays thereof.  IO, etc. is only done in streams,
>>>> e.g, TextWr.  We can easily add new variants of these that encode/decode
>>>> by various rules.
>>>> 
>>>> Of course, it is still valid to put a string of bytes in a Text.T and
>>>> apply, e.g., UTF-8 interpretation yourself.  But that's lower-level
>>>> programming, and shouldn't confuse the abstraction.
>>>> 
>>>>> 
>>>>> I can see a use for various wide characters -- the things you extract
>>>>> from a TEXT by parsing biits of it, but none for anything
>>>>> really new complicated for wide TEXT.
>>>>> 
>>>>> The only confusing thing is that the existing operations for extracting
>>>>> bytes from TEXT have names that suggest they are extracting characters.
>>>>> 
>>>> 
>>>> I think it's more than a suggestion.  I think the abstraction clearly
>>>> considers them characters.  And it should stay that way.  If you want,
>>>> at a higher level of code, to treat them as bytes, that's fine, but the
>>>> abstraction continues to view them as characters (which only you, the
>>>> client, know is not really so.)
>>>> 
>>>>> -- Hendrik
>>>>> 
>>> 
> 
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20120702/14b30aa9/attachment-0002.html>


More information about the M3devel mailing list