[M3devel] This disgusting TEXT business

hendrik at topoi.pooq.com hendrik at topoi.pooq.com
Sun Dec 21 21:40:03 CET 2008


On Sat, Dec 20, 2008 at 10:10:45AM +0100, Dragiša Durić wrote:
> Whole bussiness of mixed TEXT's - concat of TEXT and WIDETEXT is either
> slow, or produces results that make subsequent operations slow - is
> where problem is with this implementation.
> 
> IMO, best solution would be to replace internal representation with
> UTF-8. For whom may be concerned with it - make some external widechar
> conversion routines available.

UTF-8 is indeed designed so that a lot of operations can be oerformed 
directly on the bytes of UTF-8.  For example, if you sort UTF-8 strings 
based on the 8-bit unsigned values of their bytes, you get the same 
order as if you sorted them based on the integer values of their Unicode 
characters.  As for delimiting strings based on character in them, such 
as single-byte spaces, you can just process it as a byte stream.  If you 
want to find a character that's, say, three UTF8 bytes, you can do 
ordinary byte-string search for a three-byte string -- UTF-8 has 
self-delimiting characters.

> 
> That way - concat would be as fast as it can be made and other
> operations would be realistic - it is how almost everybody does their
> Unicode strings, after all. Almost everybody - excluding mobile industry
> and Microsoft :-), AFAIK.
> 
> Compiler changes would make transparent conversion of literals.

I worked on a C compiler long ago, and read the standard in great detail 
to determine that quoted strings contained characters, not bytes.  As a 
result, the compiler we wrked on had a bug when it came to handling 
Korean characters when working in a Korean environ -- one of the Korean 
two-byte characters happened to contain a null byte, and the compiler 
treated it internally as end-of-string.  Of course, we fixed, that, 
using locale-dependent character-parsing routines.

Now Unicode doesn't even have that problem -- the only character 
containing a null byte is the null character, and it's a single null 
byte.

(Java messes this up by using a different encoding for the null 
character, so that they can treat their compactly-encoded strings as 
being null-terminated.  But they don't claim to used UTF-8 either, 
though they do almost use it.  The encoding of the null character they 
use is such that almost any UTF-8 DEcoder will decode it as being a NULL 
character, unless it is specifically designed to check for it as an 
error condition.  Just look ar the high-oprder bits of the bytes and do 
your shifts and you'll be OK.)

I routinely use UTF-8 as the normal character code in all the software I 
write, and I almost never have an occasion when library-provided 
encoding and decoding functions are of much use.  If I care much about 
the syntax, I'm using a parser anyway, and a parser can just as easlily 
parse bytes with the high bit set as bytes without.

-- hendrik

> Everything else is already transparent with regard to internal
> representation.
> 
> I've sent some UTF-8 routines ages ago to Olaf. IIRC, he was reluctant
> to accept them as he did not believe C base routines were widespread
> enough. GNU world has no such reluctance. Everything is UTF8.
> 
> If nobody can fill this job in next few weeks, I will probably have some
> time this winter. No promise, of course, but list will know :-).

What's the job in question?  Code that handles UTF-8?  New data 
structures just for UTF-8 strings?  Or code that just processes 
ordinary strings and does special processing when indicated on bytes 
with the high bit set?

-- hendrik



More information about the M3devel mailing list