<html>
<head>
<style>
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 10pt;
font-family:Verdana
}
</style>
</head>
<body class='hmmessage'>
My opinions..<BR>
<BR> UTF8 is a hack.<BR> It lets people do approximately nothing and claim they have done a lot.<BR> UTF8 is popular, but I'm still very skeptical of it.<BR>
<BR> The best representation for Unicode is fixed size 16 bit characters, as Windows<BR> uses "everywhere". All the functions taking 8 bit characters are just thin<BR> wrappers over functions dealing with 16 bit characters.<BR>
I believe NTFS stores 16 bit strings natively on disk.<BR>
And so does FAT -- anything within a limited character set gets a regular old 8<BR> bit 8.3 entry on disk. Anything "long", or with "other characters", or other<BR> forms like anything with two dots, I believe is stored as full 16 bit characters.<BR> In fact, I believe if you write an installable file system on Win9x, you deal<BR> with all 16 bit characters.<BR> <BR> <BR> What Windows does with 8 bit strings if of course much worse than UTF8.<BR> 8 bit characters on Windows are often considered "encoded in the current code page",<BR> with usually no record of the code page.<BR> I think probably the best thing to do with 8 bit characters is actually<BR> require them to be 7 bits. Or even 5 or 6 six bits, and conversion between them<BR> is just zero extension or 8 bit truncation.<BR> <BR> <BR> UTF8 is progress because it is, like, one universal code page.<BR> <BR> <BR> UTF8 is slightly muddied because Java defines it differently.<BR> Specifically I think there is an issue as to how to represent an embedded nul.<BR>
There is also UTF7. I guess that's only useful in transports that aren't 8 bit clean. <BR>
<BR> Is UTF a "transport" mechanism only, for an encoding to leave data in as much as possible? <BR> That is gray of course. "Entering" and "exiting" from "transport" is expensive when done <BR> in bulk, and avoided for perf. <BR> <BR> <BR> The best representation for a "string" is a length and an array of 16 bit characters,<BR> probably via a separate pointer, not rigth after the length.<BR>
<BR> Arguably you need a 32 bit character since Unicde is, I believe, actually<BR> a 20 bit encoding. However here I'm willing to see things the other way.<BR> (ie: maybe hypocritical) <BR>
<BR> 8 bit characters often suffice and it is probably worth being somewhat "dynamic".<BR>
<BR> 8 bit bytes are also common, but should not be confused with characters and strings.<BR> <BR> <BR> Whatever functions any UTF8 code depends on, can be trivially implemented<BR> one's self, in portable C or portable Modula-3.<BR>
<BR> How about..<BR> <BR> <BR> I realized the Modula-3 code is "dynamic" and can deal with 8 or 16 bit characters.<BR> I didn't realize it concated strings via keeping around the pointers.<BR> This is actually a viable faster approach sometimes. <BR> They are called "ropes" in some contexts -- such as the SGI STL. <BR>
<BR> I have to read the code, but..two obvious suggestions I am sympathetic to:<BR>
<BR> 1) Leave things mostly alone, but add more functions people can call.<BR> Like String.Flatten, String.Cleanup, String.Seal.<BR> That would walk all the pointers and copy the data into one flat string.<BR> "Seal" is an exaggeration, since you could subsequently change the string.<BR> Problem with this approach of course is that any code that is slowed down, remains slowed down.<BR> You still have to do a little work to get back the perf.<BR> However, any code sped up by the current code remains sped up.<BR>
<BR> 2) Always "flatten" upon concat.<BR> Remain dynamic in the representation, based on what characters are seen.<BR> Once a character above 127 is seen, the string is made wide.<BR> Any operation on a wide string that both returns a subset of it, AND has to<BR> anyway visit every character, can look for the opportunity to shorten it,<BR> IF a separate copy is made.<BR> Operations such as taking a prefix or suffix -- which can be made by just<BR> bumping the pointer or length, do NOT need to visit the characters and would not<BR> opportunistically narrow.<BR>
Can taking a suffix like that work, with the garbage collector?<BR>
<BR> Furthermore, the expression:<BR> foo & "a" & bar & "b" <BR>
<BR> should be transformed, by the compiler, into one function call. <BR>
If not currently, then "to do".<BR> This alone, with nothing else, might help a lot? <BR>
<BR> Among the base-most string operations is not or shall not be "concat two strings",<BR> but rather "concat n strings".<BR>
<BR>
Hm. I think there's a third good option. Basically again looking to Windows.<BR> Windows has no one true string type..while internally Unicode is universal, all<BR> string functions are doubled up. For any given function foo, say for example CreateFile,<BR> there are actually two functions, CreateFileA and CreateFileW.<BR> <BR> <BR> To make a string literal in C or C++, you either say "foo" for an 8 bit string,<BR> or L"foo" for 16 bits. String length -- wcslen and strlen -- these names come from<BR> standard C and there are a bunch of them.<BR>
<BR> pro: no existing code changes meaning <BR> con: no existing code changes meaning <BR> Anyone who needs unicode, needs to change their code, somehow. <BR>
<BR> There is a static indirection mechanism.<BR> #ifdef UNICODE <BR> #define TCHAR WCHAR <BR> #define TEXT(a) L pasted to a <BR> #define CreateFile CreateFileW <BR> etc. <BR> #else <BR> #define TCHAR char <BR> #define a <BR> #define CreateFile CreateFileA <BR> etc. <BR> #endif <BR>
<BR> You can write code to be portable either way.<BR> This was a dubious idea, because most code only compiles one way or the other.<BR> <BR> <BR> But it does let you migrate slowly -- keep 8 bit strings but write your code<BR> with the future in mind, and then later throw the switch.<BR> Personally I just hardcode everything WCHAR, CreateFileW, no illusions.<BR>
<BR> Anyway, I bring this up to try to come up with a proposal along the lines of:<BR>
- Take everything back to the way it was.<BR> - Offer some new way to support Unicode.<BR>
but I'm not sure what the second part is.<BR> Is TEXT abstract enough that it can do double duty?<BR>
<BR> What sort of interop is needed between code using narrow text and code using wide text?<BR> Must it be "automatic"?<BR>
<BR>
<BR>
While I say UTF8 is a hack, I like something almost the same as Dragiša says.<BR>
Keep everything as 7 or 8 bit characters. Provide conversion routines.<BR>
Make no assumptions internally that the characters are UTF8.<BR>
Any internal string as array of char walking would just use individual bytes.<BR>
Or a "decoding callback". ? <BR>
Reject any text literal that isn't 7 or 8 bit clean.<BR>
? <BR>
<BR>
I don't know, I guess I'm very on the fence.<BR>
Unicode is important and all, interop seems maybe important, it is hard to get anyone<BR>
to change existing code..what CM3 did at least was seamless..in terms of no code change,<BR>
well, er..perf change, and they barred direct access to the characters, which might still be good.<BR>
Maybe you can ask for a direct pointer, of a specific type, and if you ask for the wrong type, you get NULL?<BR>
<BR>
??<BR>
<BR> Now, this email contains no code to back it up. :) <BR>
<BR>
- Jay<BR><BR><BR><BR><BR>> From: dragisha@m3w.org<BR>> To: m3devel@elegosoft.com<BR>> Date: Sat, 20 Dec 2008 10:10:45 +0100<BR>> Subject: Re: [M3devel] This disgusting TEXT business<BR>> <BR>> Whole bussiness of mixed TEXT's - concat of TEXT and WIDETEXT is either<BR>> slow, or produces results that make subsequent operations slow - is<BR>> where problem is with this implementation.<BR>> <BR>> IMO, best solution would be to replace internal representation with<BR>> UTF-8. For whom may be concerned with it - make some external widechar<BR>> conversion routines available.<BR>> <BR>> That way - concat would be as fast as it can be made and other<BR>> operations would be realistic - it is how almost everybody does their<BR>> Unicode strings, after all. Almost everybody - excluding mobile industry<BR>> and Microsoft :-), AFAIK.<BR>> <BR>> Compiler changes would make transparent conversion of literals.<BR>> Everything else is already transparent with regard to internal<BR>> representation.<BR>> <BR>> I've sent some UTF-8 routines ages ago to Olaf. IIRC, he was reluctant<BR>> to accept them as he did not believe C base routines were widespread<BR>> enough. GNU world has no such reluctance. Everything is UTF8.<BR>> <BR>> If nobody can fill this job in next few weeks, I will probably have some<BR>> time this winter. No promise, of course, but list will know :-).<BR>> <BR>> dd<BR>> <BR>> On Sat, 2008-12-20 at 19:26 +1100, Tony Hosking wrote:<BR>> > Hmm, are we just victims of poor implementation? Does anyone have the <BR>> > time to improve things? It would be possible to rip out CM3 TEXT and <BR>> > replace with PM3, but we'd lose WIDECHAR and WideText.T with that <BR>> > too. Not sure who that impacts.<BR>> > <BR>> > On 20 Dec 2008, at 18:19, Mika Nystrom wrote:<BR>> > <BR>> > > Hello Modula-3ers,<BR>> > ><BR>> > > I have gone on the record before as not being very impressed by the<BR>> > > Critical Mass implementation of TEXT.<BR>> > ><BR>> <BR>> -- <BR>> Dragiša Durić <dragisha@m3w.org><BR>> <BR><BR></body>
</html>