[M3devel] UTF-8 TEXT
Rodney Bates
rodney_bates at lcwb.coop
Mon Jul 2 16:50:18 CEST 2012
-Rodney Bates
--- antony.hosking at gmail.com wrote:
From: Antony Hosking <antony.hosking at gmail.com>
To: "Rodney M. Bates" <rodney_bates at lcwb.coop>
Cc: "m3devel at elegosoft.com" <m3devel at elegosoft.com>
Subject: Re: [M3devel] UTF-8 TEXT
Date: Thu, 28 Jun 2012 10:37:36 -0400
Why not simply say that CHAR is an enumeration representing all of UTF-32?
The current definition merely says that CHAR is an enumeration containing *at least* 256 elements.
We would need to translate the current Latin-1 literals into UTF-32.
And we could simply have a new literal form for Unicode literals.
This is almost what I would propose to do, with a couple of differences:
Leave CHAR alone and fix WIDECHAR to handle the entire Unicode space.
I am sure there is lots of existing code that depends on the implementation
properties: ORD(FIRST(CHAR))=0, ORD(LAST(CHAR))=255, and BYTESIZE(CHAR)=1.
Then I would define, in the language itself, that WIDECHAR is Unicode, not
UTF-32. Thus ORD(LAST(WIDECHAR))=16_10FFFF. Then I would make it an
implementation characteristic that BYTESIZE(WIDECHAR))=4.
On Jun 27, 2012, at 10:12 PM, Rodney M. Bates wrote:
>
>
> On 06/27/2012 07:32 PM, Antony Hosking wrote:
>> So what do we do about 6-byte UTF-8 code points? They won't fit in WIDECHAR. Surely we should allow accessing a UTF-8 character as a CARDINAL and be done with it?
>>
>
> Absolutely. Except I think a better way is to make WIDECHAR big enough to hold all of
> Unicode.
>
>> Sent from my iPad
>>
>> On Jun 27, 2012, at 3:20 PM, "Rodney M. Bates"<rodney_bates at lcwb.coop> wrote:
>>
>>>
>>>
>>> On 06/26/2012 10:30 PM, Hendrik Boom wrote:
>>>> On Tue, Jun 26, 2012 at 04:22:22PM -0400, Coleburn, Randy wrote:
>>>>> I seem to recall that Rodney did some work a while back relating to TEXT.
>>>>> Rodney, can you weigh in on some of this?
>>>>> --Randy Coleburn
>>>>>
>>>>> From: Dragiša Durić [mailto:dragisha at m3w.org]
>>>>> Sent: Tuesday, June 26, 2012 12:46 PM
>>>>> To: Jay
>>>>> Cc: m3devel
>>>>> Subject: EXT Re: [M3devel] AND (., 16_ff). Not serious - or so I hope!
>>>>>
>>>>> You had idea in other message. Store length!
>>>>>
>>>>> Another idea - store partial list of indices to character locations. So whatever one does, that list can be used/expanded. Whatever storage issues this makes, they are probably minor as compared to 32bit WIDECHAR for all idea.
>>>>
>>>> Most of the time, you don't need explicit integer indexes to character
>>>> locations. What you do need is an operation that fetches a character
>>>> given the string and its index (whatever data structure that index is),
>>>> and one that increments the index past that character. As long as you
>>>> can save an index and use it later on the same string, that's probably
>>>> all you ever need. And with a simple TEXT representation (such as the
>>>> obvious array of bytes containing characters of various widths) a byte
>>>> index is all you need (note: NOT a character index). It's easy even to
>>>> use TEXT and its integer indices as the data representation, as long as
>>>> you use the proper functions parse the characters and increment the
>>>> indices by amounts that might differ from 1.
>>>>
>>>> And if your source code is represented in UTF-8, the representation that
>>>> requires little extra compiler effort to parse, your TEXT strings will
>>>> automagically appear in UTF-8.
>>>
>>> The original designers of the language and its libraries have given us
>>> two different abstractions for handling character strings (in addition
>>> to plain arrays.) 1) Text, and 2) Wr, Rd, and their cousins.
>>>
>>> Text is highly general and easy to use. Concatentations and substrings
>>> are easy. Semantics, to its clients, are value semantics, similar to INTEGER.
>>> Random access by *character* number is easy and, hopefully, implemented
>>> with efficiency at least better than O(n).
>>>
>>> Wr and friends restrict you to sequential access, at least mostly, but
>>> gain implementation convenience and efficiency as a result.
>>>
>>> I feel very stongly that we should *not* take away the full generality
>>> of Text, especially efficient random access, to handle variable-length
>>> character encodings in strings. For these, lets make more friends of
>>> Wr and Rd, which already assume sequential access. For example, a
>>> filter pipe that sequentially reads a Text/Array/stream, applies a UTF-8
>>> interpretation to its bytes, and delivers a stream of Unicode characters,
>>> in variables of type WIDECHAR.
>>>
>>> Text should preserve the abstraction that it's a string of characters,
>>> generalized as it already is in cm3, to have type WIDECHAR, so they can be any
>>> Unicode character. The internal representation should, usually, not be
>>> of concern.
>>>
>>> Note that nowhere in Text are character values transferred between
>>> a Text.T and any form of I/O stream. In the Text abstraction, all
>>> characters go in and out of a Text.T in variables of type CHAR,
>>> WIDECHAR, and arrays thereof. IO, etc. is only done in streams,
>>> e.g, TextWr. We can easily add new variants of these that encode/decode
>>> by various rules.
>>>
>>> Of course, it is still valid to put a string of bytes in a Text.T and
>>> apply, e.g., UTF-8 interpretation yourself. But that's lower-level
>>> programming, and shouldn't confuse the abstraction.
>>>
>>>>
>>>> I can see a use for various wide characters -- the things you extract
>>>> from a TEXT by parsing biits of it, but none for anything
>>>> really new complicated for wide TEXT.
>>>>
>>>> The only confusing thing is that the existing operations for extracting
>>>> bytes from TEXT have names that suggest they are extracting characters.
>>>>
>>>
>>> I think it's more than a suggestion. I think the abstraction clearly
>>> considers them characters. And it should stay that way. If you want,
>>> at a higher level of code, to treat them as bytes, that's fine, but the
>>> abstraction continues to view them as characters (which only you, the
>>> client, know is not really so.)
>>>
>>>> -- Hendrik
>>>>
>>
More information about the M3devel
mailing list