[M3devel] Fw: UTF-16: Greek alphabet with CM3

Hendrik Boom hendrik at topoi.pooq.com
Sat Nov 30 18:29:06 CET 2013


On Sat, Nov 30, 2013 at 10:52:44AM -0600, Rodney M. Bates wrote:
> Another devilish detail to be aware of:  UTF-16 is _not_ the same as
> the current Modula-3 16-bit WIDECHAR, even when restricted to values
> <= 16_FFFF.  Current Wr/Rd library code  just writes/reads
> exactly 16 bits in two bytes, with whatever code point is in the
> WIDECHAR variable.
> 
> In contrast, UTF-16 will encode code points greater than
> UFFFF as a pair of 16-bit code units with surrogate values in them.
> Then to make this work right, the surrogate values are not
> allowed in unencoded variables.  So attempting to encode a surrogate
> in UTF-16 is an error, and decoding a surrogate that is not part of a
> proper first-surrogate/second-surrogate pair is "ill formed" and usually
> decodes to UFFFD.
> 
> You could get by with treating these as interchangeable only be being
> careful to ensure there is never either a surrogate code nor a code
> point > UFFFF, in either input or output.
> 
> Also, current Wr/Rd always write/read only in little-endian byte order,
> whereas there are both little- and big-endian variants of UTF-16.
> I have no idea which endianness of UTF-16 is used by various GUI
> libraries, but it would have to be little for this to work.

It lools as if one might as well use UTF-8 if one is going to consider UTF-16.

I looked up XIM on Wikipedia (http://en.wikipedia.org/wiki/X_Input_Method).
and it referred to newer systems, SCIM, uim, and IIMF.  IIMF ppears to have
been superseded by SCIM, I don't know the status of uim, except that 
it has a uim bridge.

It does look as if SCIM 
(http://en.wikipedia.org/wiki/Smart_Common_Input_Method) is intended 
as a simple way to interface to many other input methods, such as XIM.  
It may be worth a look.

--- hendrik




More information about the M3devel mailing list