<table cellspacing="0" cellpadding="0" border="0" ><tr><td valign="top" style="font: inherit;">Hi all:<br>I was thinking in back-end encoding of the CHARs in WIDECHAR using Rd/Wr-Rep but the mentioned modules are done around the idea of efficient machine implementation.<br>I just think that the only need for having a UTF-8 or whatever encoding for CHARs and WIDECHAR is in a machine with those types.<br>Numerous µ-coded "rare little" JVM machines are capable of handling that kind of Unicodes but anything else is just spurious to me, make that encoding for everybody in CM3.<br>There isn't any other machine with that byte encoding that I know about so the good news is that the machines are reduced to: 1) Industrial Size scenario JVM 2) Small sized vendor machines, a web browser client like a JS?<br><br>I hope with that we find some common ground for a solution for the issue.<br>Thanks in advance<br><br>--- El <b>lun, 2/7/12, Hendrik Boom
<i><hendrik@topoi.pooq.com></i></b> escribió:<br><blockquote style="border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; padding-left: 5px;"><br>De: Hendrik Boom <hendrik@topoi.pooq.com><br>Asunto: Re: [M3devel] UTF-8 TEXT<br>Para: m3devel@elegosoft.com<br>Fecha: lunes, 2 de julio, 2012 11:54<br><br><div class="plainMail">On Mon, Jul 02, 2012 at 11:57:14AM -0400, Tony Hosking wrote:<br>> <br>> On Jul 2, 2012, at 10:50 AM, Rodney Bates wrote:<br>> <br>> > <br>> > <br>> > -Rodney Bates<br>> > <br>> > --- <a ymailto="mailto:antony.hosking@gmail.com" href="/mc/compose?to=antony.hosking@gmail.com">antony.hosking@gmail.com</a> wrote:<br>> > <br>> >> From: Antony Hosking <<a ymailto="mailto:antony.hosking@gmail.com" href="/mc/compose?to=antony.hosking@gmail.com">antony.hosking@gmail.com</a>><br>> >> To: "Rodney M. Bates" <<a ymailto="mailto:rodney_bates@lcwb.coop"
href="/mc/compose?to=rodney_bates@lcwb.coop">rodney_bates@lcwb.coop</a>><br>> >> Cc: "<a ymailto="mailto:m3devel@elegosoft.com" href="/mc/compose?to=m3devel@elegosoft.com">m3devel@elegosoft.com</a>" <<a ymailto="mailto:m3devel@elegosoft.com" href="/mc/compose?to=m3devel@elegosoft.com">m3devel@elegosoft.com</a>><br>> >> Subject: Re: [M3devel] UTF-8 TEXT<br>> >> Date: Thu, 28 Jun 2012 10:37:36 -0400<br>> >> <br>> >> Why not simply say that CHAR is an enumeration representing all of UTF-32?<br>> >> The current definition merely says that CHAR is an enumeration containing *at least* 256 elements.<br>> >> We would need to translate the current Latin-1 literals into UTF-32.<br>> >> And we could simply have a new literal form for Unicode literals.<br>> >> <br>> > This is almost what I would propose to do, with a couple of differences:<br>> > <br>> >
Leave CHAR alone and fix WIDECHAR to handle the entire Unicode space.<br>> > I am sure there is lots of existing code that depends on the implementation<br>> > properties: ORD(FIRST(CHAR))=0, ORD(LAST(CHAR))=255, and BYTESIZE(CHAR)=1.<br>> <br>> Fair enough. Would we leave the encoding of CHAR as ISO-Latin-1? We’d still need translation from ISO-Latin-1 to UTF-8 wouldn’t we?<br>> <br>> > Then I would define, in the language itself, that WIDECHAR is Unicode, not<br>> > UTF-32. Thus ORD(LAST(WIDECHAR))=16_10FFFF. Then I would make it an<br>> > implementation characteristic that BYTESIZE(WIDECHAR))=4.<br>> <br>> I note this text from the Wikipedia entry for UTF-32:<br><br>I had just looked this paragraph up on Wikipedia to post it when I <br>noticed you had already done so.<br><br>> <br>> Though a fixed number of bytes per code point appear convenient, it is <br>> not as useful as
it appears. <br><br>Wich is the gist of my objection to storing implementing TEXT as <br>fixed-width 16, 20, or 32-bit storage units. It wastes space without <br>much gain. (Exception might be made for a few languages that can be <br>efficiently stored in 16 bits but not in UTF-8.)<br><br>> It makes truncation easier but not significantly so compared to UTF-8 <br>> and UTF-16.<br>> It does not make it faster to find a particular offset in the string, <br>> as an "offset" can be measured in the fixed-size code units of any <br>> encoding.<br><br>Exactly why I want character-extraction to be expressible in efficient <br>"offsets" with implementation-independent specifications (though <br>possibly implementatino-dependent values). I don't mind if character <br>counts are also made available, as long as it doesn't impose extra <br>overhead on those that don't use them. Operations with offsets that <br>allow one to
extract characters and skip over characters are sufficient <br>for most purposes. The use of efficient offsets is independent of the <br>question of access to individual bytes.<br><br>> It does not make calculating the displayed width of a string easier <br>> except in limited cases, since even with a “fixed width” font there <br>> may be more than one code point per character position (combining <br>> marks) or more than one character position per code point (for example <br>> CJK ideographs). <br>> Combining marks mean editors cannot treat one code point as being the <br>> same as one unit for editing. Editors that limit themselves to <br>> left-to-right languages and precomposed characters can take advantage <br>> of fixed-sized code units, but such editors are unlikely to support <br>> non-BMP characters and thus can work equally well with 16-bit UTF-16 <br>> encoding.<br><br>I'd like to point out that most
string processing doesn't really deal in <br>characters at all, but in terms of words, phrases, symbols, and other <br>linguistic structures that have to be dealt with using parsing. <br>Assembling bytes of UTF-8 into characters is just more parsing, and <br>should be viewed as such.<br><br>For many applications it isn't even necessary to decode UTF-8, because <br>it can be copied without being aware of its character structure.<br>And it the language ascribes special meanings only to some of the first <br>128 characters, these can be unambiguously recognised in UTF-8 without <br>decoding UTF-8 at all. This does argue for having byte access as well.<br><br>> <br>> Does this argue against WIDECHAR=UTF-32? Would we be better off simply saying WIDECHAR=UTF-16 and leaving things as they are? Yes, it would make the definition of WideCharAt a little odd, because the index would be defined in 16-bit units rather than UTF-16
glyph code-points.<br>> <br>> By the way, if we did change WIDECHAR to an enumeration containing 16_110000 elements then the stored (memory) size of WIDECHAR would be 4 bytes given the current CM3 implementation of enumerations, which chooses a (non-PACKED) stored size of 1/2/4/8 bytes depending on the number of elements.<br><br>16-bit WIDECHARs would seem to me to be the worst choice of all, except <br>in the special case that you *know* that all the characters you'll ever <br>have to deaal with fit in 16 bits and most of them won't fit in 8.<br><br>I'd use WIDECHAR when I'm dealing with individual <br>characters/UnicodeCodepoints. I'd use TEXT when dealing with strings. <br>Or some custom data structure that can handle text containing strings <br>and other data structure (suched as parse trees). <br>Generally, there won't be a lot of WIDECHARS around in a running <br>program, so I don't care much about the few extra
bytes.<br><br>-- hendrik<br><br></div></blockquote></td></tr></table>