[M3devel] Fw: UTF-16: Greek alphabet with CM3

Sun Dec 1 03:01:14 CET 2013

Windows is localized to and runs well with very many languages, using 16bit WCHAR..

 - Jay

> Date: Sat, 30 Nov 2013 19:16:21 -0500
> From: hendrik at topoi.pooq.com
> To: m3devel at elegosoft.com
> Subject: Re: [M3devel] Fw: UTF-16: Greek alphabet with CM3
> 
> On Sat, Nov 30, 2013 at 01:59:47PM -0600, Rodney M. Bates wrote:
> > 
> > 
> > On 11/30/2013 11:29 AM, Hendrik Boom wrote:
> > >On Sat, Nov 30, 2013 at 10:52:44AM -0600, Rodney M. Bates wrote:
> > >>Another devilish detail to be aware of:  UTF-16 is _not_ the same as
> > >>the current Modula-3 16-bit WIDECHAR, even when restricted to values
> > >><= 16_FFFF.  Current Wr/Rd library code  just writes/reads
> > >>exactly 16 bits in two bytes, with whatever code point is in the
> > >>WIDECHAR variable.
> > >>
> > >>In contrast, UTF-16 will encode code points greater than
> > >>UFFFF as a pair of 16-bit code units with surrogate values in them.
> > >>Then to make this work right, the surrogate values are not
> > >>allowed in unencoded variables.  So attempting to encode a surrogate
> > >>in UTF-16 is an error, and decoding a surrogate that is not part of a
> > >>proper first-surrogate/second-surrogate pair is "ill formed" and usually
> > >>decodes to UFFFD.
> > >>
> > >>You could get by with treating these as interchangeable only be being
> > >>careful to ensure there is never either a surrogate code nor a code
> > >>point > UFFFF, in either input or output.
> > >>
> > >>Also, current Wr/Rd always write/read only in little-endian byte order,
> > >>whereas there are both little- and big-endian variants of UTF-16.
> > >>I have no idea which endianness of UTF-16 is used by various GUI
> > >>libraries, but it would have to be little for this to work.
> > >
> > >It lools as if one might as well use UTF-8 if one is going to consider UTF-16.
> > 
> > Hmm.  Actually, *if* one could live with the restrictions on values above,
> > passing the same strings back and forth, with the GUI considering them UTF-16LE
> > and the Modula-3 app code considering them cm3's 16_bit WIDECHAR, would have
> > the advantage that the M3 app code could deal naturally in characters, rather
> > than varying numbers of fragments of characters.  UTF-8 would require
> > the latter.
> 
> And then we just wait for the potential user who can't, and we'll have 
> this discussion all over again.
> 
> With the disadvantage that we'll end up having to put still more 
> mechanisms for handling text everywhere.
> 
> -- hendrik
> 
> 
> > 
> > 
> > >
> > >I looked up XIM on Wikipedia (http://en.wikipedia.org/wiki/X_Input_Method).
> > >and it referred to newer systems, SCIM, uim, and IIMF.  IIMF ppears to have
> > >been superseded by SCIM, I don't know the status of uim, except that
> > >it has a uim bridge.
> > >
> > >It does look as if SCIM
> > >(http://en.wikipedia.org/wiki/Smart_Common_Input_Method) is intended
> > >as a simple way to interface to many other input methods, such as XIM.
> > >It may be worth a look.
> > >
> > >--- hendrik
> > >
> > >
> > 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20131201/90f56fa3/attachment-0002.html>