[M3devel] This disgusting TEXT business

Tue Dec 23 19:43:00 CET 2008

  As my mother's tongue uses two alphabets for writing, Latin covered by
ISO-8859-2 and Cyrillic covered by ISO-8859-5, with one-to-one glyph
correspondence and three digraphs in Latin variant I think I have
over-average experience with non-Latin1 alphabets, in various areas.

  If I have to express what we call widetext literal in my code, I will
have to work with Unicode tables and pick character by character.
Tedious!

  What I would do is - switch my keyboard to either Latin or Cyrillic
mapping and - imagine that!!! - just type! Thus getting UTF-8 characters
into my source. My example literal would be:

CONST
 MyNameInCyrillic = "Драгиша Дурић";
 MyNameInLatin = "Dragiša Durić";

You can see or not these glyphs, depending on your MUA and to some
degree on MTA's in transit. 

With all WIDE* talk it is what I am using. Me being example guy from
non-Latin1 world. How many of you are non-Latin1 people and using 16bit
"W literals" ?

dd

On Tue, 2008-12-23 at 11:28 -0600, Rodney M. Bates wrote:
> I hear three problems with CM3 TEXT:
> 
> 1) WIDECHAR and the TEXT implementation won't handle Unicode values that
>     exceed 2^16-1.
> 
...
> I think we already have a reasonably designed abstraction for TEXT,
> a bit of it built in to the language and the rest in Text.i3.  Only
> problem 1) affects the abstraction.  Addressing only 1) for now:
> 
> It ought to cause minimal grief to just change WIDECHAR so it has a
> big enough value range for all the Unicode values, probably 32 bits
> in today's world.  Surely nobody has written any code that assumes
> BITSIZE(WIDECHAR)=16. ;-)  Even if so, this shouldn't be a terribly
> hard change to adapt old code to, since the static rules of the
> language would point directly to most places that would need to be
> fixed.
> 
> That leaves the wide TEXT literals.  It is easy to forget, and I have
> to keep reminding myself, but (assuming again that nobody has gotten
> their fingers improperly into the implementation pie) there is
> currently
> no such thing as a "WIDETEXT" type.  Both kinds of literals are of
> type
> TEXT.  They are just different lexical rules for specifying literal
> values of type TEXT.  A bit like '16' and '16_10' are different ways
> of writing the same value, with the same type INTEGER.  This differs
> from the CHAR and WIDECHAR literals, which really are of different
> types.
> 
> The one change needed to the W"..." literals would be to allow
> escape sequences inside for giving characters numerically.  Right
> now, the \x0123 form of escape requires exactly 4 hex digits.  If
> we added a new alternative escape letter, in addition to the 'x', that
> required, say, exactly 8 hex digits, then these literals could express
> characters in the needed extra space, without affecting existing code.
> I suppose, for consistency and completeness, we should also add a new
> octal escape sequence that was long enough for the full new range.
> 
> And, we would also need to allow the same new escape sequences in
> WIDECHAR
> literals.
> 
...
> 
> - Rodney Bates
> 
-- 
Dragiša Durić <dragisha at m3w.org>