[M3devel] Simple change to WIDECHAR type

Mon Jul 2 17:04:25 CEST 2012

-Rodney Bates

--- dragisha at m3w.org wrote:

From: Dragiša Durić <dragisha at m3w.org>
To: Antony Hosking <antony.hosking at gmail.com>
Cc: m3devel <m3devel at elegosoft.com>
Subject: Re: [M3devel] Simple change to WIDECHAR type
Date: Sat, 30 Jun 2012 09:33:00 +0200

Current GetChar/SetChars and GetWideChar/SetWideChars are not character-level access methods, in terms of Unicode. They are  "byte-level", fixed width data accesses. Reason: Both CHAR (cardinality 2^8) and WIDECHAR (cardinality 2^16) based strings must use one or more characters to represent whole Unicode (cardinality 2^20).  If we must encode in any case, then we don't have any benefit of WIDECHAR (as it is implemented/understood now) at all!

To represent Unicode with either CHAR or WIDECHAR based TEXTs - we must use either UTF-8 or UTF-16. Both are one-to-multibyte encodings, encoding one Unicode character to either 1-4 CHARs or 1-2 WIDECHARs.

What exactly is meaning (at Modula-3 usual levels of abstraction) of character-level access? Do we need whatever bit pattern physically happening at some location in our data's representation. Or maybe we need numerical representation of actual, visually distinguishable in written representation, Unicode character value? One from that set of 2^20 elements?

What is meaning of Text.Sub() based on byte-level access operations where our resulting TEXTs first character is in fact a prefix of some Unicode characters encoding? And/or where our last character is invalid/incomplete suffix of some encoded character.

Since when are fast and efficient operations doing something we don't need at all our priority?

We are getting nothing at all with WIDECHAR. No. Single. Thing. WIDECHAR does not make us closer to Unicode at all. WIDECHAR, together with CHAR (in context of our current TEXT) makes two almost-solutions to Unicode problem and existence of WIDECHAR scalar type makes us a bit closer to Unicode almost-solution of C world and nothing else.

-------------------------------------------------------------------------------------------------------------------------------------------
I think the only reason why we got nothing is that WIDECHAR isn't wide enough.  Let's fix that.
--------------------------------------------------------------------------------------------------------------------------------------- 

Currently, neither GetChar nor GetWideChar can get "a character at nth position". Reason: No character scalar type to keep any Unicode character.

Solution:
======

* Redefine WIDECHAR to hold at least 20 bit values, or create UNICHAR or GLYPH (and leave WIDECHAR as it is for vertical compatibility) so we can hold unencoded Unicode characters in scalar values in our Modula-3 programs, while preserving their properties.
* Implement properties, relations and methods defined for  Unicode. With ASCII, numeric order is everything. With Unicode - it is not. This is probably very big project but we can start somewhere, and let interested parties build on it. Dirk Muysers did work in this regard already.
* Whoever thinks we don't need this and our "tradition" and "legacy" are important, please read this: http://unicode.org/standard/WhatIsUnicode.html .

dd

On Jun 29, 2012, at 5:52 PM, Dragiša Durić wrote:

> That, or UTF-16 encoding on top of current WIDECHAR.
> 
> On Jun 29, 2012, at 3:50 PM, Antony Hosking wrote:
> 
>> That will change WIDECHAR from a value consuming 16-bits of memory into a value consuming 32-bits of memory.  In other words, all TEXT containing WIDECHAR will double in size.
>> 
>> On Jun 29, 2012, at 4:35 AM, Dragiša Durić wrote:
>> 
>>> m3front/src/builtinTypes/WCharr.m3, line:
>>> 
>>>  T := EnumType.New (16_10000, elts);
>>> 
>>> to
>>> 
>>>  T := EnumType.New (16_100000, elts);
>>> 
>>> Will this break things? Any other assumptions anywhere?
>>> 
>> 
>