[M3devel] Fw: UTF-16: Greek alphabet with CM3
Rodney M. Bates
rodney_bates at lcwb.coop
Sat Nov 30 20:59:47 CET 2013
On 11/30/2013 11:29 AM, Hendrik Boom wrote:
> On Sat, Nov 30, 2013 at 10:52:44AM -0600, Rodney M. Bates wrote:
>> Another devilish detail to be aware of: UTF-16 is _not_ the same as
>> the current Modula-3 16-bit WIDECHAR, even when restricted to values
>> <= 16_FFFF. Current Wr/Rd library code just writes/reads
>> exactly 16 bits in two bytes, with whatever code point is in the
>> WIDECHAR variable.
>>
>> In contrast, UTF-16 will encode code points greater than
>> UFFFF as a pair of 16-bit code units with surrogate values in them.
>> Then to make this work right, the surrogate values are not
>> allowed in unencoded variables. So attempting to encode a surrogate
>> in UTF-16 is an error, and decoding a surrogate that is not part of a
>> proper first-surrogate/second-surrogate pair is "ill formed" and usually
>> decodes to UFFFD.
>>
>> You could get by with treating these as interchangeable only be being
>> careful to ensure there is never either a surrogate code nor a code
>> point > UFFFF, in either input or output.
>>
>> Also, current Wr/Rd always write/read only in little-endian byte order,
>> whereas there are both little- and big-endian variants of UTF-16.
>> I have no idea which endianness of UTF-16 is used by various GUI
>> libraries, but it would have to be little for this to work.
>
> It lools as if one might as well use UTF-8 if one is going to consider UTF-16.
Hmm. Actually, *if* one could live with the restrictions on values above,
passing the same strings back and forth, with the GUI considering them UTF-16LE
and the Modula-3 app code considering them cm3's 16_bit WIDECHAR, would have
the advantage that the M3 app code could deal naturally in characters, rather
than varying numbers of fragments of characters. UTF-8 would require
the latter.
>
> I looked up XIM on Wikipedia (http://en.wikipedia.org/wiki/X_Input_Method).
> and it referred to newer systems, SCIM, uim, and IIMF. IIMF ppears to have
> been superseded by SCIM, I don't know the status of uim, except that
> it has a uim bridge.
>
> It does look as if SCIM
> (http://en.wikipedia.org/wiki/Smart_Common_Input_Method) is intended
> as a simple way to interface to many other input methods, such as XIM.
> It may be worth a look.
>
> --- hendrik
>
>
More information about the M3devel
mailing list