[M3devel] This disgusting TEXT business

Sun Dec 21 11:33:18 CET 2008

Most important argument for UTF-8 - IMO - would be it's completeness.
2byte WIDECHAR is not complete. Period. Be it mahhjong tile or some
obscure Chinese glyph - 16 bytes is not enough. Would we do some 20bit
storage magic instead?

Accepting 2 byte would be same mistake once over. Once Latin1 was whole
world. Is it 2byte now?

As for "length" argument - it is cacheable data. As our TEXT is object
type, we can do such things.

As for "substr", it is sound counter argument. Internal optimizations
are possible, of course. For lengthy strings, we can implement it, and
efficiently.

Once internal representation of TEXT was possible to pass to C without
conversion. With UTF8 - it's possible again. Java is not world around us
- it's C.

And again - it's not it's storage efficiency that coounts - it's
completeness.

On Sat, 2008-12-20 at 19:38 +0100, Roland Illig wrote:
> Dragiša Durić schrieb:
> > IMO, best solution would be to replace internal representation with
> > UTF-8. For whom may be concerned with it - make some external widechar
> > conversion routines available.
> > 
> > That way - concat would be as fast as it can be made and other
> > operations would be realistic - it is how almost everybody does their
> > Unicode strings, after all. Almost everybody - excluding mobile industry
> > and Microsoft :-), AFAIK.
> 
> I object. UTF-8 is a good external representation format, but in memory,
> it is far more efficient to let all characters of a string have the same
> size. Java, for example, has defined a character to be "what fits into
> 16 bits". That's good enough in many situations. One of the situations
> it fails is when you want to represent Domino Tiles or Mahjong Tiles as
> characters. <http://www.unicode.org/Public/UNIDATA/Blocks.txt>
> 
> To give you some arguments against UTF-8 as in-memory representation:
> 
> * How fast can you calculate the length of a string? (With length I mean
> the number of characters, not the memory size needed for storing them.)
> 
> * Where does substring(5) start? Where does substring(10000) start, and
> how long does it take to find out?
> 
> An argument for the UTF-8 representation is that it saves memory -- one
> byte for every ASCII character. Is this argument still as important as
> it has been 40 years ago?
> 
> Roland
-- 
Dragiša Durić <dragisha at m3w.org>