[M3devel] UTF-8 TEXT

Mon Jul 2 16:50:18 CEST 2012

-Rodney Bates

--- antony.hosking at gmail.com wrote:

From: Antony Hosking <antony.hosking at gmail.com>
To: "Rodney M. Bates" <rodney_bates at lcwb.coop>
Cc: "m3devel at elegosoft.com" <m3devel at elegosoft.com>
Subject: Re: [M3devel] UTF-8 TEXT
Date: Thu, 28 Jun 2012 10:37:36 -0400

Why not simply say that CHAR is an enumeration representing all of UTF-32?
The current definition merely says that CHAR is an enumeration containing *at least* 256 elements.
We would need to translate the current Latin-1 literals into UTF-32.
And we could simply have a new literal form for Unicode literals.

This is almost what I would propose to do, with a couple of differences:

Leave CHAR alone and fix WIDECHAR to handle the entire Unicode space.
I am sure there is lots of existing code that depends on the implementation
properties: ORD(FIRST(CHAR))=0, ORD(LAST(CHAR))=255, and BYTESIZE(CHAR)=1.

Then I would define, in the language itself, that WIDECHAR is Unicode, not
UTF-32.  Thus ORD(LAST(WIDECHAR))=16_10FFFF. Then I would make it an
implementation characteristic that BYTESIZE(WIDECHAR))=4.

On Jun 27, 2012, at 10:12 PM, Rodney M. Bates wrote:

> 
> 
> On 06/27/2012 07:32 PM, Antony Hosking wrote:
>> So what do we do about 6-byte UTF-8 code points?  They won't fit in WIDECHAR.  Surely we should allow accessing a UTF-8 character as a CARDINAL and be done with it?
>> 
> 
> Absolutely.  Except I think a better way is to make WIDECHAR big enough to hold all of
> Unicode.
> 
>> Sent from my iPad
>> 
>> On Jun 27, 2012, at 3:20 PM, "Rodney M. Bates"<rodney_bates at lcwb.coop>  wrote:
>> 
>>> 
>>> 
>>> On 06/26/2012 10:30 PM, Hendrik Boom wrote:
>>>> On Tue, Jun 26, 2012 at 04:22:22PM -0400, Coleburn, Randy wrote:
>>>>> I seem to recall that Rodney did some work a while back relating to TEXT.
>>>>> Rodney, can you weigh in on some of this?
>>>>> --Randy Coleburn
>>>>> 
>>>>> From: Dragiša Durić [mailto:dragisha at m3w.org]
>>>>> Sent: Tuesday, June 26, 2012 12:46 PM
>>>>> To: Jay
>>>>> Cc: m3devel
>>>>> Subject: EXT Re: [M3devel] AND (., 16_ff). Not serious - or so I hope!
>>>>> 
>>>>> You had idea in other message. Store length!
>>>>> 
>>>>> Another idea - store partial list of indices to character locations. So whatever one does, that list can be used/expanded. Whatever storage issues this makes, they are probably minor as compared to 32bit WIDECHAR for all idea.
>>>> 
>>>> Most of the time, you don't need explicit integer indexes to character
>>>> locations.  What you do need is an operation that fetches a character
>>>> given the string and its index (whatever data structure that index is),
>>>> and  one that increments the index past that character.  As long as you
>>>> can save an index and use it later on the same string, that's probably
>>>> all you ever need.  And with a simple TEXT representation (such as the
>>>> obvious array of bytes containing characters of various widths) a byte
>>>> index is all you need (note: NOT a character index).  It's easy even to
>>>> use TEXT and its integer indices as the data representation, as long as
>>>> you use the proper functions parse the characters and increment the
>>>> indices by amounts that might differ from 1.
>>>> 
>>>> And if your source code is represented in UTF-8, the representation that
>>>> requires little extra compiler effort to parse,  your TEXT strings will
>>>> automagically appear in UTF-8.
>>> 
>>> The original designers of the language and its libraries have given us
>>> two different abstractions for handling character strings (in addition
>>> to plain arrays.)  1) Text, and 2) Wr, Rd, and their cousins.
>>> 
>>> Text is highly general and easy to use.  Concatentations and substrings
>>> are easy.  Semantics, to its clients, are value semantics, similar to INTEGER.
>>> Random access by *character* number is easy and, hopefully, implemented
>>> with efficiency at least better than O(n).
>>> 
>>> Wr and friends restrict you to sequential access, at least mostly, but
>>> gain implementation convenience and efficiency as a result.
>>> 
>>> I feel very stongly that we should *not* take away the full generality
>>> of Text, especially efficient random access, to handle variable-length
>>> character encodings in strings.  For these, lets make more friends of
>>> Wr and Rd, which already assume sequential access.  For example, a
>>> filter pipe that sequentially reads a Text/Array/stream, applies a UTF-8
>>> interpretation to its bytes, and delivers a stream of Unicode characters,
>>> in variables of type WIDECHAR.
>>> 
>>> Text should preserve the abstraction that it's a string of characters,
>>> generalized as it already is in cm3, to have type WIDECHAR, so they can be any
>>> Unicode character.  The internal representation should, usually, not be
>>> of concern.
>>> 
>>> Note that nowhere in Text are character values transferred between
>>> a Text.T and any form of I/O stream.  In the Text abstraction, all
>>> characters go in and out of a Text.T in variables of type CHAR,
>>> WIDECHAR, and arrays thereof.  IO, etc. is only done in streams,
>>> e.g, TextWr.  We can easily add new variants of these that encode/decode
>>> by various rules.
>>> 
>>> Of course, it is still valid to put a string of bytes in a Text.T and
>>> apply, e.g., UTF-8 interpretation yourself.  But that's lower-level
>>> programming, and shouldn't confuse the abstraction.
>>> 
>>>> 
>>>> I can see a use for various wide characters -- the things you extract
>>>> from a TEXT by parsing biits of it, but none for anything
>>>> really new complicated for wide TEXT.
>>>> 
>>>> The only confusing thing is that the existing operations for extracting
>>>> bytes from TEXT have names that suggest they are extracting characters.
>>>> 
>>> 
>>> I think it's more than a suggestion.  I think the abstraction clearly
>>> considers them characters.  And it should stay that way.  If you want,
>>> at a higher level of code, to treat them as bytes, that's fine, but the
>>> abstraction continues to view them as characters (which only you, the
>>> client, know is not really so.)
>>> 
>>>> -- Hendrik
>>>> 
>>