[M3devel] UTF-8 TEXT

Rodney M. Bates rodney_bates at lcwb.coop
Thu Jun 28 04:12:26 CEST 2012



On 06/27/2012 07:32 PM, Antony Hosking wrote:
> So what do we do about 6-byte UTF-8 code points?  They won't fit in WIDECHAR.  Surely we should allow accessing a UTF-8 character as a CARDINAL and be done with it?
>

Absolutely.  Except I think a better way is to make WIDECHAR big enough to hold all of
Unicode.

> Sent from my iPad
>
> On Jun 27, 2012, at 3:20 PM, "Rodney M. Bates"<rodney_bates at lcwb.coop>  wrote:
>
>>
>>
>> On 06/26/2012 10:30 PM, Hendrik Boom wrote:
>>> On Tue, Jun 26, 2012 at 04:22:22PM -0400, Coleburn, Randy wrote:
>>>> I seem to recall that Rodney did some work a while back relating to TEXT.
>>>> Rodney, can you weigh in on some of this?
>>>> --Randy Coleburn
>>>>
>>>> From: Dragiša Durić [mailto:dragisha at m3w.org]
>>>> Sent: Tuesday, June 26, 2012 12:46 PM
>>>> To: Jay
>>>> Cc: m3devel
>>>> Subject: EXT Re: [M3devel] AND (., 16_ff). Not serious - or so I hope!
>>>>
>>>> You had idea in other message. Store length!
>>>>
>>>> Another idea - store partial list of indices to character locations. So whatever one does, that list can be used/expanded. Whatever storage issues this makes, they are probably minor as compared to 32bit WIDECHAR for all idea.
>>>
>>> Most of the time, you don't need explicit integer indexes to character
>>> locations.  What you do need is an operation that fetches a character
>>> given the string and its index (whatever data structure that index is),
>>> and  one that increments the index past that character.  As long as you
>>> can save an index and use it later on the same string, that's probably
>>> all you ever need.  And with a simple TEXT representation (such as the
>>> obvious array of bytes containing characters of various widths) a byte
>>> index is all you need (note: NOT a character index).  It's easy even to
>>> use TEXT and its integer indices as the data representation, as long as
>>> you use the proper functions parse the characters and increment the
>>> indices by amounts that might differ from 1.
>>>
>>> And if your source code is represented in UTF-8, the representation that
>>> requires little extra compiler effort to parse,  your TEXT strings will
>>> automagically appear in UTF-8.
>>
>> The original designers of the language and its libraries have given us
>> two different abstractions for handling character strings (in addition
>> to plain arrays.)  1) Text, and 2) Wr, Rd, and their cousins.
>>
>> Text is highly general and easy to use.  Concatentations and substrings
>> are easy.  Semantics, to its clients, are value semantics, similar to INTEGER.
>> Random access by *character* number is easy and, hopefully, implemented
>> with efficiency at least better than O(n).
>>
>> Wr and friends restrict you to sequential access, at least mostly, but
>> gain implementation convenience and efficiency as a result.
>>
>> I feel very stongly that we should *not* take away the full generality
>> of Text, especially efficient random access, to handle variable-length
>> character encodings in strings.  For these, lets make more friends of
>> Wr and Rd, which already assume sequential access.  For example, a
>> filter pipe that sequentially reads a Text/Array/stream, applies a UTF-8
>> interpretation to its bytes, and delivers a stream of Unicode characters,
>> in variables of type WIDECHAR.
>>
>> Text should preserve the abstraction that it's a string of characters,
>> generalized as it already is in cm3, to have type WIDECHAR, so they can be any
>> Unicode character.  The internal representation should, usually, not be
>> of concern.
>>
>> Note that nowhere in Text are character values transferred between
>> a Text.T and any form of I/O stream.  In the Text abstraction, all
>> characters go in and out of a Text.T in variables of type CHAR,
>> WIDECHAR, and arrays thereof.  IO, etc. is only done in streams,
>> e.g, TextWr.  We can easily add new variants of these that encode/decode
>> by various rules.
>>
>> Of course, it is still valid to put a string of bytes in a Text.T and
>> apply, e.g., UTF-8 interpretation yourself.  But that's lower-level
>> programming, and shouldn't confuse the abstraction.
>>
>>>
>>> I can see a use for various wide characters -- the things you extract
>>> from a TEXT by parsing biits of it, but none for anything
>>> really new complicated for wide TEXT.
>>>
>>> The only confusing thing is that the existing operations for extracting
>>> bytes from TEXT have names that suggest they are extracting characters.
>>>
>>
>> I think it's more than a suggestion.  I think the abstraction clearly
>> considers them characters.  And it should stay that way.  If you want,
>> at a higher level of code, to treat them as bytes, that's fine, but the
>> abstraction continues to view them as characters (which only you, the
>> client, know is not really so.)
>>
>>> -- Hendrik
>>>
>



More information about the M3devel mailing list