[M3announce] New TEXT algorithms checked in

Thu May 23 23:34:02 CEST 2013

I have checked in changes to the Text implementation in the head.
These make no change to the cm3 data structure or its invariants, only
different choices of alternate representations for a given abstract
character string.  The major changes are in Cat, which, while O(1),
created representations inefficient to access.  This was particularly
bad (O(n)) after building a TEXT value by a linear series of concatenations,
either left-to-right or right-to-left.  These were also very extravagant in
heap space usage.

To get the changes built and running on your machine, just build and ship
m3core, using cm3 or scripts/do-cm3-min.sh.

As you might expect with any purely functional style abstraction, there
are gains and losses in space and time performance, depending greatly on
the usage pattern.  Overall, there are mostly small to large net gains
on balanced mixtures of creating and accessing strings, both in time
and space.  When strings are built linearly, net speed gains around 4-to-1
are common.

The new algorithms are extensively tested on LINUXLIBC6 and AMD64_LINUX.
Beyond native word size, there is little reason to expect platform-dependent
bugs.  Of course, anything can happen.  If you know or suspect there are bugs,
you can disable the new algorithms by importing and setting TextClass.Old:=TRUE.
This will use all the old algorithms instead.  This will suffer small,
constant-factor time losses relative to the unmodified implementation because
of runtime testing of TextClass.Old and also a good bit of gathering of raw
performance statistics.

You can turn TextClass.Old on and off at will, whenever no thread is executing
a Text operation.  The results of old and new algorithms are fully interchangeable
as operands of any Text operation.

You can use other global variables in TextClass to tune the algorithms.
Setting TextClass.Flatten:=FALSE disables partial flattening of concatenation
trees.  Otherwise, TextClass.MaxFlat8, and TextClass.MaxFlatWide set the
maximum lengths of internal open arrays of CHAR and WIDECHAR, respectively.
These latter are limits on how much flattening is done.

You can approximately simulate the behavior of older pm3 Text implementation
by setting these to LAST(INTEGER).  This will always fully flatten every
concatenated string.  Differences from pm3 are that the cm3 data structures
require that a separate heap object be in front of the open array, and that
this will handle WIDECHAR elements in a TEXT.

There is also an extensive test program in m3-libs/m3core/tests/newtext/src.
build it and run with -h to see its options.  It does large numbers of
random string operations, running the old and new algorithms side-by-side
and comparing the abstract values of their results  It also reports a lot of
statistics on time and space usage.  Retained heap storage numbers are now
running high, apparently due to the inability to force garbage collection
to complete.