[M3devel] This disgusting TEXT business
hendrik at topoi.pooq.com
hendrik at topoi.pooq.com
Sun Dec 21 21:40:03 CET 2008
On Sat, Dec 20, 2008 at 10:10:45AM +0100, Dragiša Durić wrote:
> Whole bussiness of mixed TEXT's - concat of TEXT and WIDETEXT is either
> slow, or produces results that make subsequent operations slow - is
> where problem is with this implementation.
>
> IMO, best solution would be to replace internal representation with
> UTF-8. For whom may be concerned with it - make some external widechar
> conversion routines available.
UTF-8 is indeed designed so that a lot of operations can be oerformed
directly on the bytes of UTF-8. For example, if you sort UTF-8 strings
based on the 8-bit unsigned values of their bytes, you get the same
order as if you sorted them based on the integer values of their Unicode
characters. As for delimiting strings based on character in them, such
as single-byte spaces, you can just process it as a byte stream. If you
want to find a character that's, say, three UTF8 bytes, you can do
ordinary byte-string search for a three-byte string -- UTF-8 has
self-delimiting characters.
>
> That way - concat would be as fast as it can be made and other
> operations would be realistic - it is how almost everybody does their
> Unicode strings, after all. Almost everybody - excluding mobile industry
> and Microsoft :-), AFAIK.
>
> Compiler changes would make transparent conversion of literals.
I worked on a C compiler long ago, and read the standard in great detail
to determine that quoted strings contained characters, not bytes. As a
result, the compiler we wrked on had a bug when it came to handling
Korean characters when working in a Korean environ -- one of the Korean
two-byte characters happened to contain a null byte, and the compiler
treated it internally as end-of-string. Of course, we fixed, that,
using locale-dependent character-parsing routines.
Now Unicode doesn't even have that problem -- the only character
containing a null byte is the null character, and it's a single null
byte.
(Java messes this up by using a different encoding for the null
character, so that they can treat their compactly-encoded strings as
being null-terminated. But they don't claim to used UTF-8 either,
though they do almost use it. The encoding of the null character they
use is such that almost any UTF-8 DEcoder will decode it as being a NULL
character, unless it is specifically designed to check for it as an
error condition. Just look ar the high-oprder bits of the bytes and do
your shifts and you'll be OK.)
I routinely use UTF-8 as the normal character code in all the software I
write, and I almost never have an occasion when library-provided
encoding and decoding functions are of much use. If I care much about
the syntax, I'm using a parser anyway, and a parser can just as easlily
parse bytes with the high bit set as bytes without.
-- hendrik
> Everything else is already transparent with regard to internal
> representation.
>
> I've sent some UTF-8 routines ages ago to Olaf. IIRC, he was reluctant
> to accept them as he did not believe C base routines were widespread
> enough. GNU world has no such reluctance. Everything is UTF8.
>
> If nobody can fill this job in next few weeks, I will probably have some
> time this winter. No promise, of course, but list will know :-).
What's the job in question? Code that handles UTF-8? New data
structures just for UTF-8 strings? Or code that just processes
ordinary strings and does special processing when indicated on bytes
with the high bit set?
-- hendrik
More information about the M3devel
mailing list