[M3devel] UTF-8 TEXT

Thu Jun 28 16:10:02 CEST 2012

On 06/28/2012 07:44 AM, Hendrik Boom wrote:
> On Wed, Jun 27, 2012 at 02:20:41PM -0500, Rodney M. Bates wrote:
>>
>> Text is highly general and easy to use.  Concatentations and substrings
>> are easy.  Semantics, to its clients, are value semantics, similar to INTEGER.
>> Random access by *character* number is easy and, hopefully, implemented
>> with efficiency at least better than O(n).
>
> Does it have to be a *character* number we use to index a string?  I
> don't know of any situations where that aspect is importnat enough
> to force everyone to waste storage on it.
>
> -- hendrik
>

It is absolutely essential that it be a character, if you care about
Text being a meaningful abstraction.  A byte index is a very low level
view, now that we have a variable-length encoding, and *especially*
now that there are multiple possible ways of representing strings.
strings.

When it was only ASCII (or ISO-latin1), it was a character
index, and the abstraction was there.  The fact that it was also a
byte index is a coincidental consequence of the choice of underlying
physical representation.  Now we have a much messier situation regarding
representations, but we should not destroy the abstraction and force
everyone to always get down into the bowels of the different representations.

There will still be mechanisms for low-level coding if you have some
compelling reason, or just don't want to rewrite something existing.
But let's protect the option of dealing with characters with the same
abstraction we have had in the past.