[M3devel] This disgusting TEXT business

Wed Dec 24 19:00:01 CET 2008

On Tue, Dec 23, 2008 at 08:34:00PM +0100, Dragiša Durić wrote:
> On Wed, 2008-12-24 at 08:10 -0500, hendrik at topoi.pooq.com wrote:
> > On Wed, Dec 24, 2008 at 08:00:15AM -0500, hendrik at topoi.pooq.com wrote:
> > > On Tue, Dec 23, 2008 at 07:43:00PM +0100, Dragiša Durić wrote:
> > > > 
> > > > CONST
> > > >  MyNameInCyrillic = "Драгиша Дурић";
> > > >  MyNameInLatin = "Dragiša Durić";
> > > > 
> > > > You can see or not these glyphs, depending on your MUA and to some
> > > > degree on MTA's in transit. 
> > > 
> > > I see it.  All we need to do to make this work is say that Modula-3 
> > > programs are encoded in Unicode.  That and some work under the hood so 
> > > make it work.
> > 
> > Well, probably a *lot* of work.
> 
>   Not at all. I am already using it. Just typing into.
> 
>   What does not work and I am not using it "length in glyphs". Not
> important for me, but not hard to implement. Either directly in Modula-3
> or using external C lib.
> 
>   Search for UTF-8 pattern in UTF-8 text works - of course. Also -
> partitioning of TEXT's by *any real* criteria. By position of CHAR, or
> by position of any substring.

I see.  You are just using the existing TEXT to store the bytes of 
UTF-8.  That seems one of the simplest ways of accomodating UTF-8, in 
fact.  No conversions in or out, just work in a pure UTF-8 environment.

The "work" I talked about would be the effert necessary to implement 
all the existing methods that use character counts as character 
counts.  Treating them as using byte counts is indeed simple.

What remains is to build operations that deal in these UTF-8 TEXTs as 
being strings of UTF-8 characters, with operations like fetching and 
storing characters instead of bytes.  Perhaps what is really needed is 
clarity in the specification, so we know which operations to use on 
bytes, and which on characters, and abolish the myth that character 
counts are an efficient way of accessing characters.
 In retrospect, maybe there should have been a type BYTE, a type 
CHARACTER, and a type TEXT, all conceptually separate.  TEXT could be 
accessed by byte or character operations; and indexes into 
TEXT would be byte offsets.  This lacks some generality that might have 
been needed in the days of machines with word-addressable 29-bit words, 
but those days seem to be past.

What seems to block introducing this approach seems to be the prevalence 
of code that uses CHARACTER as a synonym for BYTE.

Do we, for some kind of compatibility, have to use CHARACTER when we 
mean BYTE, and something else when we want to say CHARACTER?

In the long run, will be be more inconvenienced by having to rewrite 
code that uses CHARACTER to represent characters, or code that uses 
CHARACTER to represent bytes?  It is already possible to use 0..255 to 
represent bytes, isn't it?  Or am I wrong in this?  Might this mean 
four-byte integers that happen to have a restricted range of values?

-- hendrik