[M3devel] INTEGER

Tue Apr 20 20:00:40 CEST 2010

On Tue, Apr 20, 2010 at 10:59:57AM -0500, Rodney M. Bates wrote:
>
>
> hendrik at topoi.pooq.com wrote:
>> On Sun, Apr 18, 2010 at 12:59:01PM -0700, Mika Nystrom wrote:
>>> And the new m3tk, stubgen, and compiler(?) won't compile using PM3 or
>>> SRC M3 since we're no longer using the language of the Green Book.
>>> (Same comment goes for WIDECHAR, by the way.)
>>
>> When I use Unicode characters, I find myself defining them to be  
>> INTEGERs, because I'm always worried that WIDECHARs might be omly 16 
>> bits.
>
> Actually, I favor making WIDECHAR big enough to hold all the defined
> Unicode values, which I understand have been more that 2^16 for some
> time now.  The only thing the language definition says about WIDECHAR
> is that it has at least 65536 values and the first 65536 correspond to
> the first 65536 Unicode characters.  It could be bigger.  Moreover, as
> I recall from a while back, the compiler, old and new Text implementation,
> other libraries, m3gdb, and maybe pickles are only a couple of declaration
> changes away from supporting some bigger range for WIDECHAR.
>
> (I think I am the one who put WIDECHAR into the language reference, and
> that was only recently, despite its being in the cm3 language for years.  I
> also see that this has not propagated from the .tex version into the .html
> and .pdf files.)
>
>
>>
>> Strings of WIDECHARS are probably unnecesary.  The last program I wrote 
>> that used Unicode used INTEGERs for characters, and arrays of INTEGERS  
>> for strings.  But I think it was a mistake to do it this way, and when 
>> I have time I'll rewrite it.  UTF-8 seems to be the way to go; strings 
>> of Unicode can easily be manipulated as ordinary TEXT.  In fact, I  
>> think there are few if any places in my code where I would have had to  
>> do anything special whatsoever if I had just used UTF8 in TEXT.  The  
>> program would simply become simpler.
>
> This is only true if you handle strings in certain common but restricted
> ways.  If you just move strings around, UTF-8 will often work with little
> or no code change.  OTOH, if you need to access characters non-sequentially
> or make any examination or production of individual characters not in the
> Iso-latin-1 subset, any variable-length encoding quickly becomes unworkable.

Whe you are doing things like this, you are probably doing something 
like -- shall I say it?  -- *parsing*.  There's no need for parsing to 
stop at the character level.  You might as well parse all the way down 
to bytes.  And UTF-8 is designed so that that is particularly easy.

>
> Text.Sub and Text.GetChar won't work in any sensible way if they view the
> string as a string of 8-bit bytes, when it's actually not.  Reimplementing
> these to understand UTF-8 would make what was O(1) become O(n).
> Ditto Text.Length.

Exactly.  That's why you leave them as byte operations.  It's pretty 
rare that you want to pull six characters out of a string.  It's pretty 
common that you want to pull a word out of a string, having parsed part 
of it and discovering that the work is n characters wide.  You could 
instead have parsed it and discovered that it is m bytes long.

The point is, that it's almost never worthwhile to parse it once and 
turn it into an array of WIDECHARS and then to parse the result again.

Just like compilers -- they don't start by lexically scanning the entire 
source code and storing it somewhere, and then parsing it afterward.  
Not normally, anyway.

-- hendrik