[M3devel] UTF-8 TEXT

Thu Jun 28 19:02:30 CEST 2012

On Thu, Jun 28, 2012 at 09:10:02AM -0500, Rodney M. Bates wrote:
> 
> 
> On 06/28/2012 07:44 AM, Hendrik Boom wrote:
> >On Wed, Jun 27, 2012 at 02:20:41PM -0500, Rodney M. Bates wrote:
> >>
> >>Text is highly general and easy to use.  Concatentations and substrings
> >>are easy.  Semantics, to its clients, are value semantics, similar to INTEGER.
> >>Random access by *character* number is easy and, hopefully, implemented
> >>with efficiency at least better than O(n).
> >
> >Does it have to be a *character* number we use to index a string?  I
> >don't know of any situations where that aspect is importnat enough
> >to force everyone to waste storage on it.
> >
> >-- hendrik
> >
> 
> It is absolutely essential that it be a character, if you care about
> Text being a meaningful abstraction.  A byte index is a very low level
> view, now that we have a variable-length encoding, and *especially*
> now that there are multiple possible ways of representing strings.
> strings.

I'm not arguing whether the index should point to a character.  I'm 
questioning whether it need be a count of characters.  This is surely
a matter of data representation rather then concept.

A character index could be implemented in a variety of ways.

It certainly could be implemented as a character count, presumably 
for legacy applications with  attendant costs.  It could be 
implemented as a byte count if the string were implemented as an 
array of bytes.

It could be implemented as a machine address, constrained to index 
into a particular string.

It could be implemented as a pointer into a linked list of string 
pieces, together with an offset indicating where in that piece it 
currently points. 

We could even implement byte and character counts in the more exotic 
TEXT data structures if we chose; we have freedom of representation 
of TEXT without compromising integer.

We can implement *both*
  character extractors using an INTEGER *byte* count
AND
  character extractors using an INTEGER *character* count.

And we can do this in just about any representation of TEXT we come
up with. 

THe specification for the abstraction doesn't even have to say that
it'a a byte  count.  It's sufficient to say one can use an index that
is chosen for implementation efficiency.

Though it's tempting to provide a byte count for an operation that
extracts bytes, not characters.  Now that would be a low-level 
operation that does break the abstraction.

-- hendrik

> 
> When it was only ASCII (or ISO-latin1), it was a character
> index, and the abstraction was there.  The fact that it was also a
> byte index is a coincidental consequence of the choice of underlying
> physical representation.  Now we have a much messier situation regarding
> representations, but we should not destroy the abstraction and force
> everyone to always get down into the bowels of the different representations.
> 
> There will still be mechanisms for low-level coding if you have some
> compelling reason, or just don't want to rewrite something existing.
> But let's protect the option of dealing with characters with the same
> abstraction we have had in the past.

Yes, it was obviously a mistake for Modula 3 not to distringuish between
two types for character and byte.  And it's not the only language to have
have made that mistake.  There's two different abstractions here, with 
different meanings, but they share one name and one implementation.

Frankly, I don't care which of the two retains the name CHAR.  It's all 
the same to me whether

(a) characters are called WIDECHAR and bytes CHAR

or 

(b) characters are called CHAR and bytes BYTE.

because either way proograms are going to have to be changed to adapt to 
the new world.  (a) is probably less disruptive to legacy programs that 
olny evver need to deal with legacy ASCII files.  (b) is probably 
conceptually cleaner. 

What's important is that both mechanisms remain available for dealing 
with values of type TEXT.

The designers of Modula 3 have done an admirable job of providing a 
collection of abstractions that enable both conceptually clean and 
efficient implementations.  Let's not mess it up by providing only a 
conceptually clean, inefficient interface.

-- hendrik