[M3devel] UTF-8 TEXT

Wed Jun 27 21:20:41 CEST 2012

On 06/26/2012 10:30 PM, Hendrik Boom wrote:
> On Tue, Jun 26, 2012 at 04:22:22PM -0400, Coleburn, Randy wrote:
>> I seem to recall that Rodney did some work a while back relating to TEXT.
>> Rodney, can you weigh in on some of this?
>> --Randy Coleburn
>>
>> From: Dragiša Durić [mailto:dragisha at m3w.org]
>> Sent: Tuesday, June 26, 2012 12:46 PM
>> To: Jay
>> Cc: m3devel
>> Subject: EXT Re: [M3devel] AND (., 16_ff). Not serious - or so I hope!
>>
>> You had idea in other message. Store length!
>>
>> Another idea - store partial list of indices to character locations. So whatever one does, that list can be used/expanded. Whatever storage issues this makes, they are probably minor as compared to 32bit WIDECHAR for all idea.
>
> Most of the time, you don't need explicit integer indexes to character
> locations.  What you do need is an operation that fetches a character
> given the string and its index (whatever data structure that index is),
> and  one that increments the index past that character.  As long as you
> can save an index and use it later on the same string, that's probably
> all you ever need.  And with a simple TEXT representation (such as the
> obvious array of bytes containing characters of various widths) a byte
> index is all you need (note: NOT a character index).  It's easy even to
> use TEXT and its integer indices as the data representation, as long as
> you use the proper functions parse the characters and increment the
> indices by amounts that might differ from 1.
>
> And if your source code is represented in UTF-8, the representation that
> requires little extra compiler effort to parse,  your TEXT strings will
> automagically appear in UTF-8.

The original designers of the language and its libraries have given us
two different abstractions for handling character strings (in addition
to plain arrays.)  1) Text, and 2) Wr, Rd, and their cousins.

Text is highly general and easy to use.  Concatentations and substrings
are easy.  Semantics, to its clients, are value semantics, similar to INTEGER.
Random access by *character* number is easy and, hopefully, implemented
with efficiency at least better than O(n).

Wr and friends restrict you to sequential access, at least mostly, but
gain implementation convenience and efficiency as a result.

I feel very stongly that we should *not* take away the full generality
of Text, especially efficient random access, to handle variable-length
character encodings in strings.  For these, lets make more friends of
Wr and Rd, which already assume sequential access.  For example, a
filter pipe that sequentially reads a Text/Array/stream, applies a UTF-8
interpretation to its bytes, and delivers a stream of Unicode characters,
in variables of type WIDECHAR.

Text should preserve the abstraction that it's a string of characters,
generalized as it already is in cm3, to have type WIDECHAR, so they can be any
Unicode character.  The internal representation should, usually, not be
of concern.

Note that nowhere in Text are character values transferred between
a Text.T and any form of I/O stream.  In the Text abstraction, all
characters go in and out of a Text.T in variables of type CHAR,
WIDECHAR, and arrays thereof.  IO, etc. is only done in streams,
e.g, TextWr.  We can easily add new variants of these that encode/decode
by various rules.

Of course, it is still valid to put a string of bytes in a Text.T and
apply, e.g., UTF-8 interpretation yourself.  But that's lower-level
programming, and shouldn't confuse the abstraction.

>
> I can see a use for various wide characters -- the things you extract
> from a TEXT by parsing biits of it, but none for anything
> really new complicated for wide TEXT.
>
> The only confusing thing is that the existing operations for extracting
> bytes from TEXT have names that suggest they are extracting characters.
>

I think it's more than a suggestion.  I think the abstraction clearly
considers them characters.  And it should stay that way.  If you want,
at a higher level of code, to treat them as bytes, that's fine, but the
abstraction continues to view them as characters (which only you, the
client, know is not really so.)

> -- Hendrik
>