[M3devel] UTF-8 TEXT

Mon Jul 2 22:44:44 CEST 2012

Hi all:
I was thinking in back-end encoding of the CHARs in WIDECHAR using Rd/Wr-Rep but the mentioned modules are done around the idea of efficient machine implementation.
I just think that the only need for having a UTF-8 or whatever encoding for CHARs and WIDECHAR is in a machine with those types.
Numerous µ-coded "rare little" JVM machines are capable of handling that kind of Unicodes but anything else is just spurious to me, make that encoding for everybody in CM3.
There isn't any other machine with that byte encoding that I know about so the good news is that the machines are reduced to: 1) Industrial Size scenario JVM 2) Small sized vendor machines, a web browser client like a JS?

I hope with that we find some common ground for a solution for the issue.
Thanks in advance

--- El lun, 2/7/12, Hendrik Boom <hendrik at topoi.pooq.com> escribió:

De: Hendrik Boom <hendrik at topoi.pooq.com>
Asunto: Re: [M3devel] UTF-8 TEXT
Para: m3devel at elegosoft.com
Fecha: lunes, 2 de julio, 2012 11:54

On Mon, Jul 02, 2012 at 11:57:14AM -0400, Tony Hosking wrote:
> 
> On Jul 2, 2012, at 10:50 AM, Rodney Bates wrote:
> 
> > 
> > 
> > -Rodney Bates
> > 
> > --- antony.hosking at gmail.com wrote:
> > 
> >> From: Antony Hosking <antony.hosking at gmail.com>
> >> To: "Rodney M. Bates" <rodney_bates at lcwb.coop>
> >> Cc: "m3devel at elegosoft.com" <m3devel at elegosoft.com>
> >> Subject: Re: [M3devel] UTF-8 TEXT
> >> Date: Thu, 28 Jun 2012 10:37:36 -0400
> >> 
> >> Why not simply say that CHAR is an enumeration representing all of UTF-32?
> >> The current definition merely says that CHAR is an enumeration containing *at least* 256 elements.
> >> We would need to translate the current Latin-1 literals into UTF-32.
> >> And we could simply have a new literal form for Unicode literals.
> >> 
> > This is almost what I would propose to do, with a couple of differences:
> > 
> > Leave CHAR alone and fix WIDECHAR to handle the entire Unicode space.
> > I am sure there is lots of existing code that depends on the implementation
> > properties: ORD(FIRST(CHAR))=0, ORD(LAST(CHAR))=255, and BYTESIZE(CHAR)=1.
> 
> Fair enough.  Would we leave the encoding of CHAR as ISO-Latin-1?  We’d still need translation from ISO-Latin-1 to UTF-8 wouldn’t we?
> 
> > Then I would define, in the language itself, that WIDECHAR is Unicode, not
> > UTF-32.  Thus ORD(LAST(WIDECHAR))=16_10FFFF. Then I would make it an
> > implementation characteristic that BYTESIZE(WIDECHAR))=4.
> 
> I note this text from the Wikipedia entry for UTF-32:

I had just looked this paragraph up on Wikipedia to post it when I 
noticed you had already done so.

> 
> Though a fixed number of bytes per code point appear convenient, it is 
> not as useful as it appears. 

Wich is the gist of my objection to storing implementing TEXT as 
fixed-width 16, 20, or 32-bit storage units.  It wastes space without 
much gain.  (Exception might be made for a few languages that can be 
efficiently stored in 16 bits but not in UTF-8.)

> It makes truncation easier but not significantly so compared to UTF-8 
> and UTF-16.
> It does not make it faster to find a particular offset in the string, 
> as an "offset" can be measured in the fixed-size code units of any 
> encoding.

Exactly why I want character-extraction to be expressible in efficient 
"offsets" with implementation-independent specifications (though 
possibly implementatino-dependent values).  I don't mind if character 
counts are also made available, as long as it doesn't impose extra 
overhead on those that don't use them.  Operations with offsets that 
allow one to extract characters and skip over characters are sufficient 
for most purposes.  The use of efficient offsets is independent of the 
question of access to individual bytes.

> It does not make calculating the displayed width of a string easier 
> except in limited cases, since even with a “fixed width” font there 
> may be more than one code point per character position (combining 
> marks) or more than one character position per code point (for example 
> CJK ideographs). 
> Combining marks mean editors cannot treat one code point as being the 
> same as one unit for editing. Editors that limit themselves to 
> left-to-right languages and precomposed characters can take advantage 
> of fixed-sized code units, but such editors are unlikely to support 
> non-BMP characters and thus can work equally well with 16-bit UTF-16 
> encoding.

I'd like to point out that most string processing doesn't really deal in 
characters at all, but in terms of words, phrases, symbols, and other 
linguistic structures that have to be dealt with using parsing.  
Assembling bytes of UTF-8 into characters is just more parsing, and 
should be viewed as such.

For  many applications it isn't even necessary to decode UTF-8, because 
it can be copied without being aware of its character structure.
And it the language ascribes special meanings only to some of the first 
128 characters, these can be unambiguously recognised in UTF-8 without 
decoding UTF-8 at all.  This does argue for having byte access as well.

> 
> Does this argue against WIDECHAR=UTF-32?  Would we be better off simply saying WIDECHAR=UTF-16 and leaving things as they are?  Yes, it would make the definition of WideCharAt a little odd, because the index would be defined in 16-bit units rather than UTF-16 glyph code-points.
> 
> By the way, if we did change WIDECHAR to an enumeration containing 16_110000 elements then the stored (memory) size of WIDECHAR would be 4 bytes given the current CM3 implementation of enumerations, which chooses a (non-PACKED) stored size of 1/2/4/8 bytes depending on the number of elements.

16-bit WIDECHARs would seem to me to be the worst choice of all, except 
in the special case that you *know* that all the characters you'll ever 
have to deaal with fit in 16 bits and most of them won't fit in 8.

I'd use WIDECHAR when I'm dealing with individual 
characters/UnicodeCodepoints.  I'd use TEXT when dealing with strings.  
Or some custom data structure that can handle text containing strings 
and other data structure (suched as parse trees).  
Generally, there won't be a lot of WIDECHARS around in a running 
program, so I don't care much about the few extra bytes.

-- hendrik

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20120702/0e3c4392/attachment-0002.html>