<html>
<head>
<style><!--
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 12pt;
font-family:Calibri
}
--></style></head>
<body class='hmmessage'><div dir='ltr'>Windows is localized to and runs well with very many languages, using 16bit WCHAR..<BR> <BR> - Jay<br><br> <BR><div>> Date: Sat, 30 Nov 2013 19:16:21 -0500<br>> From: hendrik@topoi.pooq.com<br>> To: m3devel@elegosoft.com<br>> Subject: Re: [M3devel] Fw: UTF-16: Greek alphabet with CM3<br>> <br>> On Sat, Nov 30, 2013 at 01:59:47PM -0600, Rodney M. Bates wrote:<br>> > <br>> > <br>> > On 11/30/2013 11:29 AM, Hendrik Boom wrote:<br>> > >On Sat, Nov 30, 2013 at 10:52:44AM -0600, Rodney M. Bates wrote:<br>> > >>Another devilish detail to be aware of: UTF-16 is _not_ the same as<br>> > >>the current Modula-3 16-bit WIDECHAR, even when restricted to values<br>> > >><= 16_FFFF. Current Wr/Rd library code just writes/reads<br>> > >>exactly 16 bits in two bytes, with whatever code point is in the<br>> > >>WIDECHAR variable.<br>> > >><br>> > >>In contrast, UTF-16 will encode code points greater than<br>> > >>UFFFF as a pair of 16-bit code units with surrogate values in them.<br>> > >>Then to make this work right, the surrogate values are not<br>> > >>allowed in unencoded variables. So attempting to encode a surrogate<br>> > >>in UTF-16 is an error, and decoding a surrogate that is not part of a<br>> > >>proper first-surrogate/second-surrogate pair is "ill formed" and usually<br>> > >>decodes to UFFFD.<br>> > >><br>> > >>You could get by with treating these as interchangeable only be being<br>> > >>careful to ensure there is never either a surrogate code nor a code<br>> > >>point > UFFFF, in either input or output.<br>> > >><br>> > >>Also, current Wr/Rd always write/read only in little-endian byte order,<br>> > >>whereas there are both little- and big-endian variants of UTF-16.<br>> > >>I have no idea which endianness of UTF-16 is used by various GUI<br>> > >>libraries, but it would have to be little for this to work.<br>> > ><br>> > >It lools as if one might as well use UTF-8 if one is going to consider UTF-16.<br>> > <br>> > Hmm. Actually, *if* one could live with the restrictions on values above,<br>> > passing the same strings back and forth, with the GUI considering them UTF-16LE<br>> > and the Modula-3 app code considering them cm3's 16_bit WIDECHAR, would have<br>> > the advantage that the M3 app code could deal naturally in characters, rather<br>> > than varying numbers of fragments of characters. UTF-8 would require<br>> > the latter.<br>> <br>> And then we just wait for the potential user who can't, and we'll have <br>> this discussion all over again.<br>> <br>> With the disadvantage that we'll end up having to put still more <br>> mechanisms for handling text everywhere.<br>> <br>> -- hendrik<br>> <br>> <br>> > <br>> > <br>> > ><br>> > >I looked up XIM on Wikipedia (http://en.wikipedia.org/wiki/X_Input_Method).<br>> > >and it referred to newer systems, SCIM, uim, and IIMF. IIMF ppears to have<br>> > >been superseded by SCIM, I don't know the status of uim, except that<br>> > >it has a uim bridge.<br>> > ><br>> > >It does look as if SCIM<br>> > >(http://en.wikipedia.org/wiki/Smart_Common_Input_Method) is intended<br>> > >as a simple way to interface to many other input methods, such as XIM.<br>> > >It may be worth a look.<br>> > ><br>> > >--- hendrik<br>> > ><br>> > ><br>> > <br></div> </div></body>
</html>