[M3devel] This disgusting TEXT business
Jay
jay.krell at cornell.edu
Sat Dec 20 17:19:49 CET 2008
My opinions..
UTF8 is a hack. It lets people do approximately nothing and claim they have done a lot. UTF8 is popular, but I'm still very skeptical of it.
The best representation for Unicode is fixed size 16 bit characters, as Windows uses "everywhere". All the functions taking 8 bit characters are just thin wrappers over functions dealing with 16 bit characters.
I believe NTFS stores 16 bit strings natively on disk.
And so does FAT -- anything within a limited character set gets a regular old 8 bit 8.3 entry on disk. Anything "long", or with "other characters", or other forms like anything with two dots, I believe is stored as full 16 bit characters. In fact, I believe if you write an installable file system on Win9x, you deal with all 16 bit characters. What Windows does with 8 bit strings if of course much worse than UTF8. 8 bit characters on Windows are often considered "encoded in the current code page", with usually no record of the code page. I think probably the best thing to do with 8 bit characters is actually require them to be 7 bits. Or even 5 or 6 six bits, and conversion between them is just zero extension or 8 bit truncation. UTF8 is progress because it is, like, one universal code page. UTF8 is slightly muddied because Java defines it differently. Specifically I think there is an issue as to how to represent an embedded nul.
There is also UTF7. I guess that's only useful in transports that aren't 8 bit clean.
Is UTF a "transport" mechanism only, for an encoding to leave data in as much as possible? That is gray of course. "Entering" and "exiting" from "transport" is expensive when done in bulk, and avoided for perf. The best representation for a "string" is a length and an array of 16 bit characters, probably via a separate pointer, not rigth after the length.
Arguably you need a 32 bit character since Unicde is, I believe, actually a 20 bit encoding. However here I'm willing to see things the other way. (ie: maybe hypocritical)
8 bit characters often suffice and it is probably worth being somewhat "dynamic".
8 bit bytes are also common, but should not be confused with characters and strings. Whatever functions any UTF8 code depends on, can be trivially implemented one's self, in portable C or portable Modula-3.
How about.. I realized the Modula-3 code is "dynamic" and can deal with 8 or 16 bit characters. I didn't realize it concated strings via keeping around the pointers. This is actually a viable faster approach sometimes. They are called "ropes" in some contexts -- such as the SGI STL.
I have to read the code, but..two obvious suggestions I am sympathetic to:
1) Leave things mostly alone, but add more functions people can call. Like String.Flatten, String.Cleanup, String.Seal. That would walk all the pointers and copy the data into one flat string. "Seal" is an exaggeration, since you could subsequently change the string. Problem with this approach of course is that any code that is slowed down, remains slowed down. You still have to do a little work to get back the perf. However, any code sped up by the current code remains sped up.
2) Always "flatten" upon concat. Remain dynamic in the representation, based on what characters are seen. Once a character above 127 is seen, the string is made wide. Any operation on a wide string that both returns a subset of it, AND has to anyway visit every character, can look for the opportunity to shorten it, IF a separate copy is made. Operations such as taking a prefix or suffix -- which can be made by just bumping the pointer or length, do NOT need to visit the characters and would not opportunistically narrow.
Can taking a suffix like that work, with the garbage collector?
Furthermore, the expression: foo & "a" & bar & "b"
should be transformed, by the compiler, into one function call.
If not currently, then "to do". This alone, with nothing else, might help a lot?
Among the base-most string operations is not or shall not be "concat two strings", but rather "concat n strings".
Hm. I think there's a third good option. Basically again looking to Windows. Windows has no one true string type..while internally Unicode is universal, all string functions are doubled up. For any given function foo, say for example CreateFile, there are actually two functions, CreateFileA and CreateFileW. To make a string literal in C or C++, you either say "foo" for an 8 bit string, or L"foo" for 16 bits. String length -- wcslen and strlen -- these names come from standard C and there are a bunch of them.
pro: no existing code changes meaning con: no existing code changes meaning Anyone who needs unicode, needs to change their code, somehow.
There is a static indirection mechanism. #ifdef UNICODE #define TCHAR WCHAR #define TEXT(a) L pasted to a #define CreateFile CreateFileW etc. #else #define TCHAR char #define a #define CreateFile CreateFileA etc. #endif
You can write code to be portable either way. This was a dubious idea, because most code only compiles one way or the other. But it does let you migrate slowly -- keep 8 bit strings but write your code with the future in mind, and then later throw the switch. Personally I just hardcode everything WCHAR, CreateFileW, no illusions.
Anyway, I bring this up to try to come up with a proposal along the lines of:
- Take everything back to the way it was. - Offer some new way to support Unicode.
but I'm not sure what the second part is. Is TEXT abstract enough that it can do double duty?
What sort of interop is needed between code using narrow text and code using wide text? Must it be "automatic"?
While I say UTF8 is a hack, I like something almost the same as Dragiša says.
Keep everything as 7 or 8 bit characters. Provide conversion routines.
Make no assumptions internally that the characters are UTF8.
Any internal string as array of char walking would just use individual bytes.
Or a "decoding callback". ?
Reject any text literal that isn't 7 or 8 bit clean.
?
I don't know, I guess I'm very on the fence.
Unicode is important and all, interop seems maybe important, it is hard to get anyone
to change existing code..what CM3 did at least was seamless..in terms of no code change,
well, er..perf change, and they barred direct access to the characters, which might still be good.
Maybe you can ask for a direct pointer, of a specific type, and if you ask for the wrong type, you get NULL?
??
Now, this email contains no code to back it up. :)
- Jay> From: dragisha at m3w.org> To: m3devel at elegosoft.com> Date: Sat, 20 Dec 2008 10:10:45 +0100> Subject: Re: [M3devel] This disgusting TEXT business> > Whole bussiness of mixed TEXT's - concat of TEXT and WIDETEXT is either> slow, or produces results that make subsequent operations slow - is> where problem is with this implementation.> > IMO, best solution would be to replace internal representation with> UTF-8. For whom may be concerned with it - make some external widechar> conversion routines available.> > That way - concat would be as fast as it can be made and other> operations would be realistic - it is how almost everybody does their> Unicode strings, after all. Almost everybody - excluding mobile industry> and Microsoft :-), AFAIK.> > Compiler changes would make transparent conversion of literals.> Everything else is already transparent with regard to internal> representation.> > I've sent some UTF-8 routines ages ago to Olaf. IIRC, he was reluctant> to accept them as he did not believe C base routines were widespread> enough. GNU world has no such reluctance. Everything is UTF8.> > If nobody can fill this job in next few weeks, I will probably have some> time this winter. No promise, of course, but list will know :-).> > dd> > On Sat, 2008-12-20 at 19:26 +1100, Tony Hosking wrote:> > Hmm, are we just victims of poor implementation? Does anyone have the > > time to improve things? It would be possible to rip out CM3 TEXT and > > replace with PM3, but we'd lose WIDECHAR and WideText.T with that > > too. Not sure who that impacts.> > > > On 20 Dec 2008, at 18:19, Mika Nystrom wrote:> > > > > Hello Modula-3ers,> > >> > > I have gone on the record before as not being very impressed by the> > > Critical Mass implementation of TEXT.> > >> > -- > Dragiša Durić <dragisha at m3w.org>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20081220/95171783/attachment-0002.html>
More information about the M3devel
mailing list