[M3devel] Fw: UTF-16: Greek alphabet with CM3
Rodney M. Bates
rodney_bates at lcwb.coop
Sun Dec 1 19:01:21 CET 2013
On 11/30/2013 06:16 PM, Hendrik Boom wrote:
> On Sat, Nov 30, 2013 at 01:59:47PM -0600, Rodney M. Bates wrote:
>>
>>
>> On 11/30/2013 11:29 AM, Hendrik Boom wrote:
>>> On Sat, Nov 30, 2013 at 10:52:44AM -0600, Rodney M. Bates wrote:
>>>> Another devilish detail to be aware of: UTF-16 is _not_ the same as
>>>> the current Modula-3 16-bit WIDECHAR, even when restricted to values
>>>> <= 16_FFFF. Current Wr/Rd library code just writes/reads
>>>> exactly 16 bits in two bytes, with whatever code point is in the
>>>> WIDECHAR variable.
>>>>
>>>> In contrast, UTF-16 will encode code points greater than
>>>> UFFFF as a pair of 16-bit code units with surrogate values in them.
>>>> Then to make this work right, the surrogate values are not
>>>> allowed in unencoded variables. So attempting to encode a surrogate
>>>> in UTF-16 is an error, and decoding a surrogate that is not part of a
>>>> proper first-surrogate/second-surrogate pair is "ill formed" and usually
>>>> decodes to UFFFD.
>>>>
>>>> You could get by with treating these as interchangeable only be being
>>>> careful to ensure there is never either a surrogate code nor a code
>>>> point > UFFFF, in either input or output.
>>>>
>>>> Also, current Wr/Rd always write/read only in little-endian byte order,
>>>> whereas there are both little- and big-endian variants of UTF-16.
>>>> I have no idea which endianness of UTF-16 is used by various GUI
>>>> libraries, but it would have to be little for this to work.
>>>
>>> It lools as if one might as well use UTF-8 if one is going to consider UTF-16.
>>
>> Hmm. Actually, *if* one could live with the restrictions on values above,
>> passing the same strings back and forth, with the GUI considering them UTF-16LE
>> and the Modula-3 app code considering them cm3's 16_bit WIDECHAR, would have
>> the advantage that the M3 app code could deal naturally in characters, rather
>> than varying numbers of fragments of characters. UTF-8 would require
>> the latter.
>
> And then we just wait for the potential user who can't, and we'll have
> this discussion all over again.
>
> With the disadvantage that we'll end up having to put still more
> mechanisms for handling text everywhere.
>
> -- hendrik
>
>
Yes, I agree completely, in general. I should have stated that, with that
idea, I was thinking specifically of Elmar's problem.
>>
>>
>>>
>>> I looked up XIM on Wikipedia (http://en.wikipedia.org/wiki/X_Input_Method).
>>> and it referred to newer systems, SCIM, uim, and IIMF. IIMF ppears to have
>>> been superseded by SCIM, I don't know the status of uim, except that
>>> it has a uim bridge.
>>>
>>> It does look as if SCIM
>>> (http://en.wikipedia.org/wiki/Smart_Common_Input_Method) is intended
>>> as a simple way to interface to many other input methods, such as XIM.
>>> It may be worth a look.
>>>
>>> --- hendrik
>>>
>>>
>>
>
More information about the M3devel
mailing list