[M3devel] INTEGER

Tue Apr 20 17:59:57 CEST 2010

hendrik at topoi.pooq.com wrote:
> On Sun, Apr 18, 2010 at 12:59:01PM -0700, Mika Nystrom wrote:
>> And the new m3tk, stubgen, and compiler(?) won't compile using PM3 or
>> SRC M3 since we're no longer using the language of the Green Book.
>> (Same comment goes for WIDECHAR, by the way.)
> 
> When I use Unicode characters, I find myself defining them to be 
> INTEGERs, because I'm always worried that WIDECHARs might be omly 16 bits.

Actually, I favor making WIDECHAR big enough to hold all the defined
Unicode values, which I understand have been more that 2^16 for some
time now.  The only thing the language definition says about WIDECHAR
is that it has at least 65536 values and the first 65536 correspond to
the first 65536 Unicode characters.  It could be bigger.  Moreover, as
I recall from a while back, the compiler, old and new Text implementation,
other libraries, m3gdb, and maybe pickles are only a couple of declaration
changes away from supporting some bigger range for WIDECHAR.

(I think I am the one who put WIDECHAR into the language reference, and
that was only recently, despite its being in the cm3 language for years.  I
also see that this has not propagated from the .tex version into the .html
and .pdf files.)

> 
> Strings of WIDECHARS are probably unnecesary.  The last program I wrote 
> that used Unicode used INTEGERs for characters, and arrays of INTEGERS 
> for strings.  But I think it was a mistake to do it this way, and when I 
> have time I'll rewrite it.  UTF-8 seems to be the way to go; strings of 
> Unicode can easily be manipulated as ordinary TEXT.  In fact, I 
> think there are few if any places in my code where I would have had to 
> do anything special whatsoever if I had just used UTF8 in TEXT.  The 
> program would simply become simpler.

This is only true if you handle strings in certain common but restricted
ways.  If you just move strings around, UTF-8 will often work with little
or no code change.  OTOH, if you need to access characters non-sequentially
or make any examination or production of individual characters not in the
Iso-latin-1 subset, any variable-length encoding quickly becomes unworkable.

Text.Sub and Text.GetChar won't work in any sensible way if they view the
string as a string of 8-bit bytes, when it's actually not.  Reimplementing
these to understand UTF-8 would make what was O(1) become O(n).
Ditto Text.Length.

> 
> -- hendrik
>