<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><div>On Jul 2, 2012, at 10:50 AM, Rodney Bates wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div><br><br>-Rodney Bates<br><br><blockquote type="cite"></blockquote></div></blockquote><div><blockquote type="cite">--- <a href="mailto:antony.hosking@gmail.com">antony.hosking@gmail.com</a> wrote:<br><br></blockquote></div><blockquote type="cite"><div><blockquote type="cite">From: Antony Hosking <<a href="mailto:antony.hosking@gmail.com">antony.hosking@gmail.com</a>><br>To: "Rodney M. Bates" <<a href="mailto:rodney_bates@lcwb.coop">rodney_bates@lcwb.coop</a>><br>Cc: "<a href="mailto:m3devel@elegosoft.com">m3devel@elegosoft.com</a>" <<a href="mailto:m3devel@elegosoft.com">m3devel@elegosoft.com</a>><br>Subject: Re: [M3devel] UTF-8 TEXT<br>Date: Thu, 28 Jun 2012 10:37:36 -0400<br><br>Why not simply say that CHAR is an enumeration representing all of UTF-32?<br>The current definition merely says that CHAR is an enumeration containing *at least* 256 elements.<br>We would need to translate the current Latin-1 literals into UTF-32.<br>And we could simply have a new literal form for Unicode literals.<br><br></blockquote>This is almost what I would propose to do, with a couple of differences:<br><br>Leave CHAR alone and fix WIDECHAR to handle the entire Unicode space.<br>I am sure there is lots of existing code that depends on the implementation<br>properties: ORD(FIRST(CHAR))=0, ORD(LAST(CHAR))=255, and BYTESIZE(CHAR)=1.<br></div></blockquote><div><br></div><div>Fair enough.  Would we leave the encoding of CHAR as ISO-Latin-1?  We’d still need translation from ISO-Latin-1 to UTF-8 wouldn’t we?</div><br><blockquote type="cite"><div>Then I would define, in the language itself, that WIDECHAR is Unicode, not<br>UTF-32.  Thus ORD(LAST(WIDECHAR))=16_10FFFF. Then I would make it an<br>implementation characteristic that BYTESIZE(WIDECHAR))=4.<br></div></blockquote><div><br></div><div>I note this text from the Wikipedia entry for UTF-32:</div><div><br></div></div><blockquote class="webkit-indent-blockquote" style="margin: 0 0 0 40px; border: none; padding: 0px;"><div><div><span class="Apple-style-span" style="font-size: 13px; line-height: 19px; font-family: sans-serif; ">Though a fixed number of bytes per code point appear convenient, it is not as useful as it appears. It makes truncation easier but not significantly so compared to <a href="http://en.wikipedia.org/wiki/UTF-8" title="UTF-8" style="text-decoration: none; color: rgb(6, 69, 173); background-image: none; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: initial; background-position: initial initial; background-repeat: initial initial; ">UTF-8</a> and<a href="http://en.wikipedia.org/wiki/UTF-16" title="UTF-16" style="text-decoration: none; color: rgb(6, 69, 173); background-image: none; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: initial; background-position: initial initial; background-repeat: initial initial; ">UTF-16</a>. It does not make it faster to find a particular offset in the string, as an "offset" can be measured in the fixed-size code units of any encoding. It does not make calculating the displayed width of a string easier except in limited cases, since even with a “fixed width” font there may be more than one code point per character position (<a href="http://en.wikipedia.org/wiki/Combining_character" title="Combining character" style="text-decoration: none; color: rgb(6, 69, 173); background-image: none; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: initial; background-position: initial initial; background-repeat: initial initial; ">combining marks</a>) or more than one character position per code point (for example <a href="http://en.wikipedia.org/wiki/CJK" title="CJK" class="mw-redirect" style="text-decoration: none; color: rgb(6, 69, 173); background-image: none; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: initial; background-position: initial initial; background-repeat: initial initial; ">CJK</a> ideographs). Combining marks mean editors cannot treat one code point as being the same as one unit for editing. Editors that limit themselves to left-to-right languages and <a href="http://en.wikipedia.org/wiki/Precomposed_character" title="Precomposed character" style="text-decoration: none; color: rgb(6, 69, 173); background-image: none; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: initial; background-position: initial initial; background-repeat: initial initial; ">precomposed characters</a> can take advantage of fixed-sized code units, but such editors are unlikely to support non-BMP characters and thus can work equally well with 16-bit <a href="http://en.wikipedia.org/wiki/UTF-16" title="UTF-16" style="text-decoration: none; color: rgb(6, 69, 173); background-image: none; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: initial; background-position: initial initial; background-repeat: initial initial; ">UTF-16</a> encoding.</span></div></div></blockquote><div><div><br></div><div>Does this argue against WIDECHAR=UTF-32?  Would we be better off simply saying WIDECHAR=UTF-16 and leaving things as they are?  Yes, it would make the definition of WideCharAt a little odd, because the index would be defined in 16-bit units rather than UTF-16 glyph code-points.</div><div><br></div><div>By the way, if we did change WIDECHAR to an enumeration containing 16_110000 elements then the stored (memory) size of WIDECHAR would be 4 bytes given the current CM3 implementation of enumerations, which chooses a (non-PACKED) stored size of 1/2/4/8 bytes depending on the number of elements.</div><br><blockquote type="cite"><div><br>On Jun 27, 2012, at 10:12 PM, Rodney M. Bates wrote:<br><br><blockquote type="cite"><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">On 06/27/2012 07:32 PM, Antony Hosking wrote:<br></blockquote><blockquote type="cite"><blockquote type="cite">So what do we do about 6-byte UTF-8 code points?  They won't fit in WIDECHAR.  Surely we should allow accessing a UTF-8 character as a CARDINAL and be done with it?<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Absolutely.  Except I think a better way is to make WIDECHAR big enough to hold all of<br></blockquote><blockquote type="cite">Unicode.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><blockquote type="cite">Sent from my iPad<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">On Jun 27, 2012, at 3:20 PM, "Rodney M. Bates"<<a href="mailto:rodney_bates@lcwb.coop">rodney_bates@lcwb.coop</a>>  wrote:<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">On 06/26/2012 10:30 PM, Hendrik Boom wrote:<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">On Tue, Jun 26, 2012 at 04:22:22PM -0400, Coleburn, Randy wrote:<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">I seem to recall that Rodney did some work a while back relating to TEXT.<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Rodney, can you weigh in on some of this?<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">--Randy Coleburn<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">From: Dragiša Durić [mailto:dragisha@m3w.org]<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Sent: Tuesday, June 26, 2012 12:46 PM<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">To: Jay<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Cc: m3devel<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Subject: EXT Re: [M3devel] AND (., 16_ff). Not serious - or so I hope!<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">You had idea in other message. Store length!<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Another idea - store partial list of indices to character locations. So whatever one does, that list can be used/expanded. Whatever storage issues this makes, they are probably minor as compared to 32bit WIDECHAR for all idea.<br></blockquote></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Most of the time, you don't need explicit integer indexes to character<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">locations.  What you do need is an operation that fetches a character<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">given the string and its index (whatever data structure that index is),<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">and  one that increments the index past that character.  As long as you<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">can save an index and use it later on the same string, that's probably<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">all you ever need.  And with a simple TEXT representation (such as the<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">obvious array of bytes containing characters of various widths) a byte<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">index is all you need (note: NOT a character index).  It's easy even to<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">use TEXT and its integer indices as the data representation, as long as<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">you use the proper functions parse the characters and increment the<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">indices by amounts that might differ from 1.<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">And if your source code is represented in UTF-8, the representation that<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">requires little extra compiler effort to parse,  your TEXT strings will<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">automagically appear in UTF-8.<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">The original designers of the language and its libraries have given us<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">two different abstractions for handling character strings (in addition<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">to plain arrays.)  1) Text, and 2) Wr, Rd, and their cousins.<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Text is highly general and easy to use.  Concatentations and substrings<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">are easy.  Semantics, to its clients, are value semantics, similar to INTEGER.<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Random access by *character* number is easy and, hopefully, implemented<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">with efficiency at least better than O(n).<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Wr and friends restrict you to sequential access, at least mostly, but<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">gain implementation convenience and efficiency as a result.<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">I feel very stongly that we should *not* take away the full generality<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">of Text, especially efficient random access, to handle variable-length<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">character encodings in strings.  For these, lets make more friends of<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Wr and Rd, which already assume sequential access.  For example, a<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">filter pipe that sequentially reads a Text/Array/stream, applies a UTF-8<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">interpretation to its bytes, and delivers a stream of Unicode characters,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">in variables of type WIDECHAR.<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Text should preserve the abstraction that it's a string of characters,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">generalized as it already is in cm3, to have type WIDECHAR, so they can be any<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Unicode character.  The internal representation should, usually, not be<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">of concern.<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Note that nowhere in Text are character values transferred between<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">a Text.T and any form of I/O stream.  In the Text abstraction, all<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">characters go in and out of a Text.T in variables of type CHAR,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">WIDECHAR, and arrays thereof.  IO, etc. is only done in streams,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">e.g, TextWr.  We can easily add new variants of these that encode/decode<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">by various rules.<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">Of course, it is still valid to put a string of bytes in a Text.T and<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">apply, e.g., UTF-8 interpretation yourself.  But that's lower-level<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">programming, and shouldn't confuse the abstraction.<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">I can see a use for various wide characters -- the things you extract<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">from a TEXT by parsing biits of it, but none for anything<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">really new complicated for wide TEXT.<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">The only confusing thing is that the existing operations for extracting<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">bytes from TEXT have names that suggest they are extracting characters.<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">I think it's more than a suggestion.  I think the abstraction clearly<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">considers them characters.  And it should stay that way.  If you want,<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">at a higher level of code, to treat them as bytes, that's fine, but the<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">abstraction continues to view them as characters (which only you, the<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">client, know is not really so.)<br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite">-- Hendrik<br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><br><br><br></div></blockquote></div><br>
<br></body></html>