[M3devel] This disgusting TEXT business

Roland Illig roland.illig at gmx.de
Sat Dec 20 19:38:01 CET 2008


Dragiša Durić schrieb:
> IMO, best solution would be to replace internal representation with
> UTF-8. For whom may be concerned with it - make some external widechar
> conversion routines available.
> 
> That way - concat would be as fast as it can be made and other
> operations would be realistic - it is how almost everybody does their
> Unicode strings, after all. Almost everybody - excluding mobile industry
> and Microsoft :-), AFAIK.

I object. UTF-8 is a good external representation format, but in memory,
it is far more efficient to let all characters of a string have the same
size. Java, for example, has defined a character to be "what fits into
16 bits". That's good enough in many situations. One of the situations
it fails is when you want to represent Domino Tiles or Mahjong Tiles as
characters. <http://www.unicode.org/Public/UNIDATA/Blocks.txt>

To give you some arguments against UTF-8 as in-memory representation:

* How fast can you calculate the length of a string? (With length I mean
the number of characters, not the memory size needed for storing them.)

* Where does substring(5) start? Where does substring(10000) start, and
how long does it take to find out?

An argument for the UTF-8 representation is that it saves memory -- one
byte for every ASCII character. Is this argument still as important as
it has been 40 years ago?

Roland



More information about the M3devel mailing list