[M3devel] UTF-16: Greek alphabet with CM3

Sun Dec 1 20:07:23 CET 2013

Problem with Unicode is multiplied with a fact how most users never really dealt with international fonts. People are usually not so good at solving problems  they never really felt. 

My mothers tongue, OTOH, is written in two scripts and Unicode is only way to cover all of it. All Western European languages and most European and near-European languages are covered with single ISO-8859-x 8-bit sets. My language needs two of these - ISO-8859-2 for Latin script, and and ISO-8859-5 for Cyrillic one.

Also, I don’t think fixed bitwidth of characters is crucial, as we can cover all uses with good abstraction. Maybe like this one.

(* (C) 2013 Dragiša Durić, dragisha at m3w.org
*)
INTERFACE UTF8;

IMPORT RefSeq;

(* UTF8.T is a subtype of TEXT, and it is also a UTF8 encoded Unicode string. 
*)
TYPE
  Char = CARDINAL;

  T <: Public;
  Public = TEXT OBJECT 
  METHODS
    init(t: TEXT): T;

    isValid(): BOOLEAN;   (* so I can do all this without exceptions *)

    length(): CARDINAL;   (* in glyphs *)
    byteSize(): CARDINAL; (* in CHARs/bytes *)
    empty(): BOOLEAN;     (* shorter than ".length() = 0" *)

    (* hash(): Word.T;
       No need for this neither here nor at all. Text.Hash would be good enough, when I come to this.
    *)

    sub(start: CARDINAL; length: CARDINAL := LAST(CARDINAL)): T;
    getText(start: CARDINAL := 0; length: CARDINAL := LAST(CARDINAL)): TEXT;
    getChar(pos: CARDINAL): Char;
    setChars(VAR a: ARRAY OF Char);

    pos(pat: T; start: CARDINAL := 0): INTEGER;
    (* Uses Boyer-Moore [1] for fast search, Observations 1 & 2 are currently implemented.

       [1] Boyer, Robert S.; Moore, J Strother (October 1977). "A Fast String Searching Algorithm.".
           Comm. ACM (New York, NY, USA: Association for Computing Machinery) 20 (10): 762–772.
    *)
    findChar(ch: Char; start: CARDINAL := 0): INTEGER;
    findCharR(ch: Char; start: CARDINAL := LAST(CARDINAL)): INTEGER;
    (* findChar returns first position of ch to the right from start position, or start if ch is there.
       findCharR returns first position of ch to the left from start position, excluding start position.
    *)

    iterate(start: CARDINAL := 0; steps: CARDINAL := LAST(CARDINAL)): Iterator;
  END;
  (* All positional/count/length parameters for methods are in Unicode glyphs
  *)

  Iterator <: PublicIterator;
  PublicIterator = OBJECT
  METHODS
    next(VAR char: Char): BOOLEAN;
    (* TODO prev? *)
  END;

(* Construction
*)
PROCEDURE New(t: TEXT): T;

PROCEDURE Cat(u, t: T): T;

PROCEDURE FromChars(READONLY chars: ARRAY OF Char): T;

PROCEDURE FromCHARArray(VAR chars: ARRAY OF CHAR): T;

(* Validation. Checks both NIL value, and invalid UTF8 string.
*)
PROCEDURE IsValid(t: T): BOOLEAN;

(* Comparation/ordering
*)
PROCEDURE Equal(u, t: T): BOOLEAN;

PROCEDURE Compare(t1, t2: T): [-1..1];

(* Future UTF8Ops, here for now:
*)
PROCEDURE EscapeS(t, s: T; escapeWith: CHAR := '\134'): T;

PROCEDURE UnEscape(t: T; esc: CHAR := '\134'): T;

PROCEDURE SplitS(t, s: T; skipSucc: BOOLEAN := TRUE): RefSeq.T;
(* Treating escaped chars like normal ones. We need to define semantics for special treatment
   Fri Apr 19 12:00 2013: For now, I only implement skipSucc=TRUE case. 
*)

PROCEDURE RemoveSpaces(t: T): T;

PROCEDURE Caps(t: T): T; (* This is probably titlecase in Unicode-speak. CHECK. Also, see what happend with Lows() in case we treat titlecase *)

PROCEDURE Lows(t: T): T;

END UTF8.

On 01 Dec 2013, at 01:16, Hendrik Boom <hendrik at topoi.pooq.com> wrote:

> On Sat, Nov 30, 2013 at 01:59:47PM -0600, Rodney M. Bates wrote:
>> 
>> 
>> On 11/30/2013 11:29 AM, Hendrik Boom wrote:
>>> On Sat, Nov 30, 2013 at 10:52:44AM -0600, Rodney M. Bates wrote:
>>>> Another devilish detail to be aware of:  UTF-16 is _not_ the same as
>>>> the current Modula-3 16-bit WIDECHAR, even when restricted to values
>>>> <= 16_FFFF.  Current Wr/Rd library code  just writes/reads
>>>> exactly 16 bits in two bytes, with whatever code point is in the
>>>> WIDECHAR variable.
>>>> 
>>>> In contrast, UTF-16 will encode code points greater than
>>>> UFFFF as a pair of 16-bit code units with surrogate values in them.
>>>> Then to make this work right, the surrogate values are not
>>>> allowed in unencoded variables.  So attempting to encode a surrogate
>>>> in UTF-16 is an error, and decoding a surrogate that is not part of a
>>>> proper first-surrogate/second-surrogate pair is "ill formed" and usually
>>>> decodes to UFFFD.
>>>> 
>>>> You could get by with treating these as interchangeable only be being
>>>> careful to ensure there is never either a surrogate code nor a code
>>>> point > UFFFF, in either input or output.
>>>> 
>>>> Also, current Wr/Rd always write/read only in little-endian byte order,
>>>> whereas there are both little- and big-endian variants of UTF-16.
>>>> I have no idea which endianness of UTF-16 is used by various GUI
>>>> libraries, but it would have to be little for this to work.
>>> 
>>> It lools as if one might as well use UTF-8 if one is going to consider UTF-16.
>> 
>> Hmm.  Actually, *if* one could live with the restrictions on values above,
>> passing the same strings back and forth, with the GUI considering them UTF-16LE
>> and the Modula-3 app code considering them cm3's 16_bit WIDECHAR, would have
>> the advantage that the M3 app code could deal naturally in characters, rather
>> than varying numbers of fragments of characters.  UTF-8 would require
>> the latter.
> 
> And then we just wait for the potential user who can't, and we'll have 
> this discussion all over again.
> 
> With the disadvantage that we'll end up having to put still more 
> mechanisms for handling text everywhere.
> 
> -- hendrik
> 
> 
>> 
>> 
>>> 
>>> I looked up XIM on Wikipedia (http://en.wikipedia.org/wiki/X_Input_Method).
>>> and it referred to newer systems, SCIM, uim, and IIMF.  IIMF ppears to have
>>> been superseded by SCIM, I don't know the status of uim, except that
>>> it has a uim bridge.
>>> 
>>> It does look as if SCIM
>>> (http://en.wikipedia.org/wiki/Smart_Common_Input_Method) is intended
>>> as a simple way to interface to many other input methods, such as XIM.
>>> It may be worth a look.
>>> 
>>> --- hendrik

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20131201/70d971df/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 495 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20131201/70d971df/attachment-0002.sig>