[M3devel] Simple change to WIDECHAR type

Sat Jun 30 18:52:54 CEST 2012

I don't fully buy this. 16bit WIDECHAR is very useful on Windows.

It can be used directly with a vast vast vast vast number of functions.

32bit char would be require conversion to and from all the time.

As well, there are no codepages when using 16 characters.

8 bit characters are interpreted in a way on/by Windows that varies per OS and per user

and which isn't stored with the string.

I realize that Modula-3 code doesn't necessarily use the same interpretation.

"no codepages" is the advantage of utf8 -- pick one "code page".

If "code page" means "how to encode/decode more than 8 bits,

8 bits at a time.

Hope all the data is 7 bit clean, so it doesn't matter.

Otherwise convert to and from a lot.

I do understand that current Unicode requires 20 bits,

and that a 32bit character type is justifiable.

As I understand, this was debated when Unicode was first designed

but rejected as too large.

 - Jay

----------------------------------------
> From: dragisha at m3w.org
> Date: Sat, 30 Jun 2012 09:33:00 +0200
> To: antony.hosking at gmail.com
> CC: m3devel at elegosoft.com
> Subject: Re: [M3devel] Simple change to WIDECHAR type
>
> Current GetChar/SetChars and GetWideChar/SetWideChars are not character-level access methods, in terms of Unicode. They are "byte-level", fixed width data accesses. Reason: Both CHAR (cardinality 2^8) and WIDECHAR (cardinality 2^16) based strings must use one or more characters to represent whole Unicode (cardinality 2^20). If we must encode in any case, then we don't have any benefit of WIDECHAR (as it is implemented/understood now) at all!
>
> To represent Unicode with either CHAR or WIDECHAR based TEXTs - we must use either UTF-8 or UTF-16. Both are one-to-multibyte encodings, encoding one Unicode character to either 1-4 CHARs or 1-2 WIDECHARs.
>
> What exactly is meaning (at Modula-3 usual levels of abstraction) of character-level access? Do we need whatever bit pattern physically happening at some location in our data's representation. Or maybe we need numerical representation of actual, visually distinguishable in written representation, Unicode character value? One from that set of 2^20 elements?
>
> What is meaning of Text.Sub() based on byte-level access operations where our resulting TEXTs first character is in fact a prefix of some Unicode characters encoding? And/or where our last character is invalid/incomplete suffix of some encoded character.
>
> Since when are fast and efficient operations doing something we don't need at all our priority?
>
> We are getting nothing at all with WIDECHAR. No. Single. Thing. WIDECHAR does not make us closer to Unicode at all. WIDECHAR, together with CHAR (in context of our current TEXT) makes two almost-solutions to Unicode problem and existence of WIDECHAR scalar type makes us a bit closer to Unicode almost-solution of C world and nothing else.
>
> Currently, neither GetChar nor GetWideChar can get "a character at nth position". Reason: No character scalar type to keep any Unicode character.
>
> Solution:
> ======
>
> * Redefine WIDECHAR to hold at least 20 bit values, or create UNICHAR or GLYPH (and leave WIDECHAR as it is for vertical compatibility) so we can hold unencoded Unicode characters in scalar values in our Modula-3 programs, while preserving their properties.
> * Implement properties, relations and methods defined for Unicode. With ASCII, numeric order is everything. With Unicode - it is not. This is probably very big project but we can start somewhere, and let interested parties build on it. Dirk Muysers did work in this regard already.
> * Whoever thinks we don't need this and our "tradition" and "legacy" are important, please read this: http://unicode.org/standard/WhatIsUnicode.html .
>
> dd
>
> On Jun 29, 2012, at 5:52 PM, Dragiša Durić wrote:
>
> > That, or UTF-16 encoding on top of current WIDECHAR.
> >
> > On Jun 29, 2012, at 3:50 PM, Antony Hosking wrote:
> >
> >> That will change WIDECHAR from a value consuming 16-bits of memory into a value consuming 32-bits of memory. In other words, all TEXT containing WIDECHAR will double in size.
> >>
> >> On Jun 29, 2012, at 4:35 AM, Dragiša Durić wrote:
> >>
> >>> m3front/src/builtinTypes/WCharr.m3, line:
> >>>
> >>> T := EnumType.New (16_10000, elts);
> >>>
> >>> to
> >>>
> >>> T := EnumType.New (16_100000, elts);
> >>>
> >>> Will this break things? Any other assumptions anywhere?
> >>>
> >>
> >
>