[M3devel] A Unicode/WIDECHAR proposal

Rodney M. Bates rodney_bates at lcwb.coop
Tue Jul 10 17:57:04 CEST 2012


Here is a more-or-less comprehensive proposal to get modern support
of Unicode and its various encodings into Modula-3 and its libraries,
while preserving both backward compatibility and original abstractions.


Summary:

Fix WIDECHAR so it holds all of Unicode.  This restores the
abstractions we once had, by treating every character as a value of a
scalar type, for in-memory processing.  The members of a TEXT and
elements of ARRAY OF WIDECHAR get this property too.

Do encoding/decoding in streams Wr and Rd, which are inherently
sequential anyway.  Give every stream an encoding property.  Add
procedures to get/put characters with encoding/decoding.  These
changes are backward-compatable.

You can still do low-level stuff if you have good reason, or just want
to leave existing code alone.  E.g., putting the bytes of UTF-8 into
the characters of a TEXT and doing your own encoding/decoding.

CHAR:

Leave CHAR as it is: exactly 256 values, encoded in ISO-Latin-1,
ORD(FIRST(CHAR))=0, ORD(LAST(CHAR))=16_FF, BYTESIZE(CHAR)=1.  The
language allows CHAR to have more values, but changing this would no
doubt seriously undermine a good bit of existing code.

WIDECHAR:

Change WIDECHAR to have exactly the Unicode range.
ORD(FIRST(WIDECHAR))=0 and ORD(LAST(WIDECHAR))=16_10FFFF.  The full
ORD and VAL functions from/to WIDECHAR are defined by the code point
to character mapping of the Unicode standard.  BYTESIZE(WIDECHAR)=4.
Make actual internal representation be Unicode code points also.  This
happens to match UTF-32, most significantly for arrays of WIDECHAR.

Note that some of the codepoint values in this range are not unicode
characters.  Programmers will need to account for this.

CHAR <: WIDECHAR, which means they are mutually assignable, with
runtime check in the one direction.  This works because the Unicode
code points and the ISO-Latin-1 code points are identical in the
entire ISO-Latin-1 range, up to 16_FF.  Note that at 16_80 and above,
the UTF-8 encoding is more than one byte, none of them equal to the
encoded code point.  This is not a problem, because both CHAR and
WIDECHAR are actual code points, not one of the bytes UTF-8.

TEXT:

TEXT continues to be defined as abstractly a sequence of WIDECHAR.  An
index into a TEXT is an integer count of characters.  The internal
representation (used only in memory, and maybe in pickles) is hidden
and could be just about anything.

Given the extreme memory inefficiency of the current cm3
implementation of TEXT, we no doubt will want to change it, but this
decision is independent and at a lower level.  The abstract interface
Text will hide this.

There is hardly a remaining need for Text.FromChar, because by
assignability, Text.FromWideChar can be used in its place, with the
same result.  But keep FromChar, for compatability with existing code.

Text.FromChars just means the code points in the created text will
happen to be members of type CHAR.

Text.GetChar and Text.GetChars will raise an exception if a
to-be-gotten code point in the TEXT lies outside the type CHAR.  This
is a change from existing behavior, which just truncates the high bits
of a WIDECHAR value and returns only the low bits.  Even if we didn't
add the exception, we would want this to be an assignability runtime
error.

Literals:

Inside wide character and wide text literals, add two new escapes, \u,
which is followed by exactly 4 hex digits denoting a code point, and
\U, which is followed by exactly 6 hex digits.  The letters 'u' and
'U' are used in this way in the Unicode standard.  \u would be
redundant with the existing \x and \X escapes, but those would merely
preserve compatability for existing code.  (Or is there so little
existing code using them that we could eliminate them for a more
consistent system?)

Encodings:

Define an enumeration giving the possible encodings used in streams:

TYPE Encoding
    = {Inherit, ISO_Latin_1, UCS_2LE, UTF_8, UTF_16, UTF_16BE, UTF_16LE,
       UTF_32, UTF_32BE, UTF_32LE};

ISO_Latin_1 means one byte per character, unconditionally.  This is
the way current Modula-3 always encodes CHAR.  An attempt to Put a
code point greater than 16_FF in this encoding will raise an
exception. (This can happen only using newly added procedures.)

Similarly, UCS_2LE, as I understand the standard, means exactly two
bytes per character, LSB first.  This is what our current Wr and Rd
use for WIDECHAR.  Here again, an exception will be raised for a code
point greater than 16_FFFF.  This, also, can happen only using newly
added procedures.

Inherit means get the encoding to be used from somewhere else, for
example, from the file system, in case it is able to store this
property of a file.

Every Wr.T and every Rd.T has an Encoding property that can be
specified when creating the stream, (from one of its subtypes).  The
ways of doing this can vary with the subtype.  This defaults to
Inherit, which means, if possible, take it from the file system, etc.
Otherwise, there are defaults for the various streams.

New operations that Put/Get Unicode characters have a parameter of
type Encoding, with a default value of Inherit, which means get the
encoding property from the stream.  Accepting this default would be
the usual way to use these procedures.

Specifying the encoding differently in the Put/Get procedure allows
mixed encodings in a single stream.  It seems dubious to encourage
this, but existing Wr and Rd already provide plenty of opportunities
to do similar stuff anyway, so this just extends existing semantics to
the new procedures.  It also allows some existing Put/Get procedures
to be defined as equivalents to new ones.

Wr:

New procedure

   PutUniWideChar(Wr: T; ch: WIDECHAR; Enc:=Encoding.Inherit)

encodes the character using Enc and appends that to the stream.  There
is hardly a need for a CHAR counterpart.  Since CHAR is assignable to
WIDECHAR, PutUniWideChar suffices for an actual parameter of either
type.  Whether the caller provides a CHAR or a WIDECHAR (or whether we
were alternatively to have different procedures) does _not_ affect the
encoding, only the value range that can be passed in.

Similar new procedures PutUniString, PutUniWideString, and PutUniText
are counterparts to PutString, PutWideString, and PutText,
respectively.

Existing PutChar and PutString, which write CHARs as one byte, each
become equivalent to PutUniWideChar and PutUniString, with
Enc:=Encoding.ISO_Latin_1.  Similarly, Existing PutWideChar and
PutWideString, which write WIDECHARs as two bytes each, becomes
equivalent to PutUniWideChar and PutUniWideString, with
Enc:=Encoding.UCS_2LE.

The existing Wr interface is peculiar, IMO, in that even though there
is currently no distinction between a text and a wide text, we have
PutText and PutWideText.  These have identical signatures, both taking
a TEXT (which can contain characters in the full WIDECHAR range).  The
difference is that PutText rather violently truncates every character
in the text to 8 bits and writes that, implicitly in ISO-Latin-1
encoding.  This is not equivalent to PutUniText with
Enc:=Encoding.ISO_Latin_1, because the latter will raise an exception
for unencodable code points.

Rd:

New procedure

   GetUniWideChar (rd:T; Enc:=Encoding.Inherit) :WIDECHAR

decodes, using Enc, and consumes, enough bytes from rd for one Unicode
code point and returns it.  There is not a lot of need for a
CHAR-returning counterpart of GetUniWideChar.  A caller can just
assign the result from GetUniWideChar to a CHAR variable and deal with
the possible range error at the call site.

GetUniSub, GetUniWideSub, GetUniSubLine, GetUniWideSubLine,
GetUniText, and GetUniTextLine are counterparts to GetSub, GetWideSub,
GetSubLine GetWideSubLine, GetWideText, and GetWideLine.  They differ
in decoding according to the Enc parameter.

In the new GetUni* procedures, any case where a partial character is
terminated by end-of-file will raise an exception.  This differs from
the current GetWide* procedures, which all implicitly use UCS_2LE and
just insert a zero byte as the MSB in this case.

Existing GetChar, GetSub, GetSubLine, GetText, and GetLine all
implicitly use the ISO-Latin-1 encoding.  GetWideChar, GetWideSub,
GetWideSubLine, GetWideText, and GetWideLine all implicitly use
UCS_2LE.  They differ from new GetUni* procedures using UCS_2LE in
that the latter raise an exception on a incomplete character.

GetUniSub and GetUniSubLine return decoded characters in ARRAY OF CHAR
and raise an exception if a decoded code point is not in CHAR.  This
might seem a bit ridiculous, but they could be useful for quick,
partial adaptation of existing code to accept newer encodings and
detect, without otherwise handling, higher code points.

Actually, GetWideText is documented as being identical to GetText, in
behavior, as well as signature.  I think this must be an editing
error.

I wonder if we need to review the rules for what constitutes a line
break.

A new UnGetUni would work like UnGetChar, but would reencode the
pushed-back character, (retained internally as a WIDECHAR), according
to its Enc parameter.  The next Get* would then redecode according to
its Enc parameter or implicit encoding, which could be different and
consume a different number of bytes.  If this seems bizarre, note that
it continues established semantics.  Existing UnGetChar will push back
a character, implicitly in ISO-Latin-1, and it is possible to call
GetWideChar next, which will use the pushed-back byte plus the byte
following, decode in UCS-2LE, and return the result.  UnGetUni will be
more complicated to implement, but it can be done.

It seems odd that there is no UnGetWideChar.  UnGetUni with
Enc:=Encoding.UCS_2LE should accomplish this.

A UniCharsReady might be nice, but it would be O(n), for UTF-8 and
UTF-16.

Of course, these changes will require corresponding changes in several
other stream-related interfaces, particularly in providing ways to
specify (and interrogated?) an encoding property of a stream.

Compiler source file encoding:

Existing rules for interpretation (defacto, from the cm3
implementation) of wide character and wide string literals depend on
the encoding of the input file.  At present, the compiler always
assumes this is ISO-latin-1.  If it actually is a UTF-8 file, as is
often the case today, this will result in incorrect conversion of
literals.

If, in our current implementation, the value of such a literal is then
written out by a Modula-3 program, unchanged, the program will write
ISO-Latin-1.  If some other program (e.g., an editor or terminal
emulator) interprets this output file as UTF-8, the reverse incorrect
reinterpretation will result in the original string.  But if the
program manipulates the characters using the language-defined
abstraction, the result will in general be incorrect.

The same scenario applies when a single program reads in ISO-Latin-1,
a file that was produced in UTF-8, writes in ISO-Latin-1, with the
output file then being fed to some other program that interprets it
as UTF-8.




More information about the M3devel mailing list