[M3devel] Fw: UTF-16: Greek alphabet with CM3

Hendrik Boom hendrik at topoi.pooq.com
Sun Dec 1 01:16:21 CET 2013


On Sat, Nov 30, 2013 at 01:59:47PM -0600, Rodney M. Bates wrote:
> 
> 
> On 11/30/2013 11:29 AM, Hendrik Boom wrote:
> >On Sat, Nov 30, 2013 at 10:52:44AM -0600, Rodney M. Bates wrote:
> >>Another devilish detail to be aware of:  UTF-16 is _not_ the same as
> >>the current Modula-3 16-bit WIDECHAR, even when restricted to values
> >><= 16_FFFF.  Current Wr/Rd library code  just writes/reads
> >>exactly 16 bits in two bytes, with whatever code point is in the
> >>WIDECHAR variable.
> >>
> >>In contrast, UTF-16 will encode code points greater than
> >>UFFFF as a pair of 16-bit code units with surrogate values in them.
> >>Then to make this work right, the surrogate values are not
> >>allowed in unencoded variables.  So attempting to encode a surrogate
> >>in UTF-16 is an error, and decoding a surrogate that is not part of a
> >>proper first-surrogate/second-surrogate pair is "ill formed" and usually
> >>decodes to UFFFD.
> >>
> >>You could get by with treating these as interchangeable only be being
> >>careful to ensure there is never either a surrogate code nor a code
> >>point > UFFFF, in either input or output.
> >>
> >>Also, current Wr/Rd always write/read only in little-endian byte order,
> >>whereas there are both little- and big-endian variants of UTF-16.
> >>I have no idea which endianness of UTF-16 is used by various GUI
> >>libraries, but it would have to be little for this to work.
> >
> >It lools as if one might as well use UTF-8 if one is going to consider UTF-16.
> 
> Hmm.  Actually, *if* one could live with the restrictions on values above,
> passing the same strings back and forth, with the GUI considering them UTF-16LE
> and the Modula-3 app code considering them cm3's 16_bit WIDECHAR, would have
> the advantage that the M3 app code could deal naturally in characters, rather
> than varying numbers of fragments of characters.  UTF-8 would require
> the latter.

And then we just wait for the potential user who can't, and we'll have 
this discussion all over again.

With the disadvantage that we'll end up having to put still more 
mechanisms for handling text everywhere.

-- hendrik


> 
> 
> >
> >I looked up XIM on Wikipedia (http://en.wikipedia.org/wiki/X_Input_Method).
> >and it referred to newer systems, SCIM, uim, and IIMF.  IIMF ppears to have
> >been superseded by SCIM, I don't know the status of uim, except that
> >it has a uim bridge.
> >
> >It does look as if SCIM
> >(http://en.wikipedia.org/wiki/Smart_Common_Input_Method) is intended
> >as a simple way to interface to many other input methods, such as XIM.
> >It may be worth a look.
> >
> >--- hendrik
> >
> >
> 



More information about the M3devel mailing list