<table cellspacing="0" cellpadding="0" border="0" ><tr><td valign="top" style="font: inherit;">Hi all:<br>Widechar is char simulation of a word sized char which is not intended by Rd/Wr implementation, read and write of literals is assuming that you won't get any real speed improvement over the DEC-SRC source to source transliteration of a given literal.<br>This is to say, what you want is the same it is CM3 TEXT type with better functionality, is better to make polymorphic functions.<br>e.g use FromChar receives both kind of chars without losing DEC-SRC representation characteristic and returning what you want in polymorphic (for instance your file text editor assumes you don't have real wide strings just yet one raw stream, then you can feed the text file in memory efficiently with a digital encoder optimized for your architecture and grab it there wherever you want, conversely opening an unused file you have to convert it at execution time, etc)
way.<br>Thanks in advance<br><br>--- El <b>mar, 10/7/12, Rodney M. Bates <i><rodney_bates@lcwb.coop></i></b> escribió:<br><blockquote style="border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; padding-left: 5px;"><br>De: Rodney M. Bates <rodney_bates@lcwb.coop><br>Asunto: [M3devel] A Unicode/WIDECHAR proposal<br>Para: "m3devel" <m3devel@elegosoft.com><br>Fecha: martes, 10 de julio, 2012 10:57<br><br><div class="plainMail">Here is a more-or-less comprehensive proposal to get modern support<br>of Unicode and its various encodings into Modula-3 and its libraries,<br>while preserving both backward compatibility and original abstractions.<br><br><br>Summary:<br><br>Fix WIDECHAR so it holds all of Unicode. This restores the<br>abstractions we once had, by treating every character as a value of a<br>scalar type, for in-memory processing. The members of a TEXT and<br>elements of ARRAY OF WIDECHAR get this property
too.<br><br>Do encoding/decoding in streams Wr and Rd, which are inherently<br>sequential anyway. Give every stream an encoding property. Add<br>procedures to get/put characters with encoding/decoding. These<br>changes are backward-compatable.<br><br>You can still do low-level stuff if you have good reason, or just want<br>to leave existing code alone. E.g., putting the bytes of UTF-8 into<br>the characters of a TEXT and doing your own encoding/decoding.<br><br>CHAR:<br><br>Leave CHAR as it is: exactly 256 values, encoded in ISO-Latin-1,<br>ORD(FIRST(CHAR))=0, ORD(LAST(CHAR))=16_FF, BYTESIZE(CHAR)=1. The<br>language allows CHAR to have more values, but changing this would no<br>doubt seriously undermine a good bit of existing code.<br><br>WIDECHAR:<br><br>Change WIDECHAR to have exactly the Unicode range.<br>ORD(FIRST(WIDECHAR))=0 and ORD(LAST(WIDECHAR))=16_10FFFF. The full<br>ORD and VAL functions from/to WIDECHAR
are defined by the code point<br>to character mapping of the Unicode standard. BYTESIZE(WIDECHAR)=4.<br>Make actual internal representation be Unicode code points also. This<br>happens to match UTF-32, most significantly for arrays of WIDECHAR.<br><br>Note that some of the codepoint values in this range are not unicode<br>characters. Programmers will need to account for this.<br><br>CHAR <: WIDECHAR, which means they are mutually assignable, with<br>runtime check in the one direction. This works because the Unicode<br>code points and the ISO-Latin-1 code points are identical in the<br>entire ISO-Latin-1 range, up to 16_FF. Note that at 16_80 and above,<br>the UTF-8 encoding is more than one byte, none of them equal to the<br>encoded code point. This is not a problem, because both CHAR and<br>WIDECHAR are actual code points, not one of the bytes UTF-8.<br><br>TEXT:<br><br>TEXT continues to be defined as abstractly a
sequence of WIDECHAR. An<br>index into a TEXT is an integer count of characters. The internal<br>representation (used only in memory, and maybe in pickles) is hidden<br>and could be just about anything.<br><br>Given the extreme memory inefficiency of the current cm3<br>implementation of TEXT, we no doubt will want to change it, but this<br>decision is independent and at a lower level. The abstract interface<br>Text will hide this.<br><br>There is hardly a remaining need for Text.FromChar, because by<br>assignability, Text.FromWideChar can be used in its place, with the<br>same result. But keep FromChar, for compatability with existing code.<br><br>Text.FromChars just means the code points in the created text will<br>happen to be members of type CHAR.<br><br>Text.GetChar and Text.GetChars will raise an exception if a<br>to-be-gotten code point in the TEXT lies outside the type CHAR. This<br>is a change from existing
behavior, which just truncates the high bits<br>of a WIDECHAR value and returns only the low bits. Even if we didn't<br>add the exception, we would want this to be an assignability runtime<br>error.<br><br>Literals:<br><br>Inside wide character and wide text literals, add two new escapes, \u,<br>which is followed by exactly 4 hex digits denoting a code point, and<br>\U, which is followed by exactly 6 hex digits. The letters 'u' and<br>'U' are used in this way in the Unicode standard. \u would be<br>redundant with the existing \x and \X escapes, but those would merely<br>preserve compatability for existing code. (Or is there so little<br>existing code using them that we could eliminate them for a more<br>consistent system?)<br><br>Encodings:<br><br>Define an enumeration giving the possible encodings used in streams:<br><br>TYPE Encoding<br> = {Inherit, ISO_Latin_1, UCS_2LE, UTF_8, UTF_16, UTF_16BE,
UTF_16LE,<br> UTF_32, UTF_32BE, UTF_32LE};<br><br>ISO_Latin_1 means one byte per character, unconditionally. This is<br>the way current Modula-3 always encodes CHAR. An attempt to Put a<br>code point greater than 16_FF in this encoding will raise an<br>exception. (This can happen only using newly added procedures.)<br><br>Similarly, UCS_2LE, as I understand the standard, means exactly two<br>bytes per character, LSB first. This is what our current Wr and Rd<br>use for WIDECHAR. Here again, an exception will be raised for a code<br>point greater than 16_FFFF. This, also, can happen only using newly<br>added procedures.<br><br>Inherit means get the encoding to be used from somewhere else, for<br>example, from the file system, in case it is able to store this<br>property of a file.<br><br>Every Wr.T and every Rd.T has an Encoding property that can be<br>specified when creating the stream, (from one of its
subtypes). The<br>ways of doing this can vary with the subtype. This defaults to<br>Inherit, which means, if possible, take it from the file system, etc.<br>Otherwise, there are defaults for the various streams.<br><br>New operations that Put/Get Unicode characters have a parameter of<br>type Encoding, with a default value of Inherit, which means get the<br>encoding property from the stream. Accepting this default would be<br>the usual way to use these procedures.<br><br>Specifying the encoding differently in the Put/Get procedure allows<br>mixed encodings in a single stream. It seems dubious to encourage<br>this, but existing Wr and Rd already provide plenty of opportunities<br>to do similar stuff anyway, so this just extends existing semantics to<br>the new procedures. It also allows some existing Put/Get procedures<br>to be defined as equivalents to new ones.<br><br>Wr:<br><br>New procedure<br><br>
PutUniWideChar(Wr: T; ch: WIDECHAR; Enc:=Encoding.Inherit)<br><br>encodes the character using Enc and appends that to the stream. There<br>is hardly a need for a CHAR counterpart. Since CHAR is assignable to<br>WIDECHAR, PutUniWideChar suffices for an actual parameter of either<br>type. Whether the caller provides a CHAR or a WIDECHAR (or whether we<br>were alternatively to have different procedures) does _not_ affect the<br>encoding, only the value range that can be passed in.<br><br>Similar new procedures PutUniString, PutUniWideString, and PutUniText<br>are counterparts to PutString, PutWideString, and PutText,<br>respectively.<br><br>Existing PutChar and PutString, which write CHARs as one byte, each<br>become equivalent to PutUniWideChar and PutUniString, with<br>Enc:=Encoding.ISO_Latin_1. Similarly, Existing PutWideChar and<br>PutWideString, which write WIDECHARs as two bytes each, becomes<br>equivalent to PutUniWideChar
and PutUniWideString, with<br>Enc:=Encoding.UCS_2LE.<br><br>The existing Wr interface is peculiar, IMO, in that even though there<br>is currently no distinction between a text and a wide text, we have<br>PutText and PutWideText. These have identical signatures, both taking<br>a TEXT (which can contain characters in the full WIDECHAR range). The<br>difference is that PutText rather violently truncates every character<br>in the text to 8 bits and writes that, implicitly in ISO-Latin-1<br>encoding. This is not equivalent to PutUniText with<br>Enc:=Encoding.ISO_Latin_1, because the latter will raise an exception<br>for unencodable code points.<br><br>Rd:<br><br>New procedure<br><br> GetUniWideChar (rd:T; Enc:=Encoding.Inherit) :WIDECHAR<br><br>decodes, using Enc, and consumes, enough bytes from rd for one Unicode<br>code point and returns it. There is not a lot of need for a<br>CHAR-returning counterpart of
GetUniWideChar. A caller can just<br>assign the result from GetUniWideChar to a CHAR variable and deal with<br>the possible range error at the call site.<br><br>GetUniSub, GetUniWideSub, GetUniSubLine, GetUniWideSubLine,<br>GetUniText, and GetUniTextLine are counterparts to GetSub, GetWideSub,<br>GetSubLine GetWideSubLine, GetWideText, and GetWideLine. They differ<br>in decoding according to the Enc parameter.<br><br>In the new GetUni* procedures, any case where a partial character is<br>terminated by end-of-file will raise an exception. This differs from<br>the current GetWide* procedures, which all implicitly use UCS_2LE and<br>just insert a zero byte as the MSB in this case.<br><br>Existing GetChar, GetSub, GetSubLine, GetText, and GetLine all<br>implicitly use the ISO-Latin-1 encoding. GetWideChar, GetWideSub,<br>GetWideSubLine, GetWideText, and GetWideLine all implicitly use<br>UCS_2LE. They differ from new GetUni*
procedures using UCS_2LE in<br>that the latter raise an exception on a incomplete character.<br><br>GetUniSub and GetUniSubLine return decoded characters in ARRAY OF CHAR<br>and raise an exception if a decoded code point is not in CHAR. This<br>might seem a bit ridiculous, but they could be useful for quick,<br>partial adaptation of existing code to accept newer encodings and<br>detect, without otherwise handling, higher code points.<br><br>Actually, GetWideText is documented as being identical to GetText, in<br>behavior, as well as signature. I think this must be an editing<br>error.<br><br>I wonder if we need to review the rules for what constitutes a line<br>break.<br><br>A new UnGetUni would work like UnGetChar, but would reencode the<br>pushed-back character, (retained internally as a WIDECHAR), according<br>to its Enc parameter. The next Get* would then redecode according to<br>its Enc parameter or implicit encoding, which could
be different and<br>consume a different number of bytes. If this seems bizarre, note that<br>it continues established semantics. Existing UnGetChar will push back<br>a character, implicitly in ISO-Latin-1, and it is possible to call<br>GetWideChar next, which will use the pushed-back byte plus the byte<br>following, decode in UCS-2LE, and return the result. UnGetUni will be<br>more complicated to implement, but it can be done.<br><br>It seems odd that there is no UnGetWideChar. UnGetUni with<br>Enc:=Encoding.UCS_2LE should accomplish this.<br><br>A UniCharsReady might be nice, but it would be O(n), for UTF-8 and<br>UTF-16.<br><br>Of course, these changes will require corresponding changes in several<br>other stream-related interfaces, particularly in providing ways to<br>specify (and interrogated?) an encoding property of a stream.<br><br>Compiler source file encoding:<br><br>Existing rules for interpretation (defacto, from the
cm3<br>implementation) of wide character and wide string literals depend on<br>the encoding of the input file. At present, the compiler always<br>assumes this is ISO-latin-1. If it actually is a UTF-8 file, as is<br>often the case today, this will result in incorrect conversion of<br>literals.<br><br>If, in our current implementation, the value of such a literal is then<br>written out by a Modula-3 program, unchanged, the program will write<br>ISO-Latin-1. If some other program (e.g., an editor or terminal<br>emulator) interprets this output file as UTF-8, the reverse incorrect<br>reinterpretation will result in the original string. But if the<br>program manipulates the characters using the language-defined<br>abstraction, the result will in general be incorrect.<br><br>The same scenario applies when a single program reads in ISO-Latin-1,<br>a file that was produced in UTF-8, writes in ISO-Latin-1, with the<br>output file then
being fed to some other program that interprets it<br>as UTF-8.<br><br></div></blockquote></td></tr></table>