[M3devel] This disgusting TEXT business

Wed Dec 24 00:22:35 CET 2008

I hear three problems with CM3 TEXT:

1) WIDECHAR and the TEXT implementation won't handle Unicode values that
    exceed 216-1.

2) The CM3 TEXT implementation has serious inefficiencies in at least some
    realistic cases.

3) We want some kind of compatibility with other software.

As for 2), the implementation could be greatly improved without changing
the existing TEXT abstraction.  It could even be improved a lot without
changing the existing data structures, only the algorithms.

Back in March, I conjectured about improving get_chars for a concatenation
by recursing on only one side and iterating on the other, but I never did
anything about it.  Text.Cat could flatten concatenations that are short
into one of the atomic representations.  It could perhaps do a two-level
rebalance of concatenation trees.  All these would help at least somewhat
with the performance consequences of building a string one character at
a time by concatenation.

And of course, ignoring the needs of those who have violated the TEXT
abstraction, we could come up with entirely new internal representations.
We could even just add some new ones and support mixtures of both the old
and new ones.  This would really just be an expansion of what we already
have, which has several representations, with all the operations dynamically
checking and adapting to them.

On the other hand, I do know from hard experience, that the implementation
size and complexity go up surprisingly as the number of alternative
representations goes up.  It's really a sort of cartesion product of
operations, operands for each operation, and representations for each
operand.

Something that has not been mentioned is the _space_ overhead of the
existing concatenation scheme.  Proponents of the pure or nearly-pure
OO language design philosophy (e.g. Smalltalk, Java) seem pretty
oblivious to how much you lose with lots of separately-allocated
heap objects connected by pointers.

Aside from the extra pointers, there is allocator/GC overhead
per object and fragmentation loss.  Then there's loss of
reference locality, leading to bigger working sets, which spills
back into time overhead.  And with heap-allocated open arrays,
there are additional 1+NoOfDimensions words as well.  So I am
strongly in favor of cutting down on the number of separate objects,
wherever reasonable.

Rodney Bates