[M3devel] This disgusting TEXT business

Wed Dec 24 01:47:36 CET 2008

The proposed interface is like Rd/Wr in that it has a procedural  
interface on an opaque type that is revealed to be implemented as an  
object for each type of Rd/Wr. But the idea is to avoid the "one true"  
representation (which what the buffer is) because efficiency means  
different things to different people. If you do a lot of Unicode text  
and an UTF-32 representation is probably fastest. If you want to save  
space or you deal mostly with ASCII then UTF-8 is probably best.

Going through each character is needed if you don't know the  
representation of the string you are dealing with. If you do know the  
format of the string you can do other operations. Since most  
applications will use one representation you will mostly be doing  
operations between strings of the same representations, which will be  
very fast, for instance you might call a local OS API to do the work.

Having application defined representations means that you can adapt  
the string representation and operations to what you need to give you  
the performance you're after, or to choose one from an existing  
library that works best.

On 24/12/2008, at 9:31 AM, Mika Nystrom wrote:

> Might I suggest an interface somewhat like the Rd/Wr interfaces?
> One could provide a buffer into which characters are read en masse
> instead of going through the method calls for each character..
>
> Of course for equality this isn't necessarily more efficient since
> you often can do an early-out: although I believe that Text.Equal
> (when performance matters) usually does return TRUE....
>
> I think it is important that basic TEXT operations be very, very
> fast.  Many people care about that.  Many of those that don't
> wouldn't be using a compiled language anyhow.
>
>    Mika
>
> Darko writes:
>> I agree with your list of problems, but you don't need a cartesian
>> product of operations for string operations between multiple
>> representations. All you need is iteration through the characters  
>> of a
>> string and concatenation with another string for each representation.
>> The object interface I proposed is little more than a wrapper for the
>> existing Text interface plus the iteration method. The getData and
>> setData methods in that object provide an interface to external APIs.
>>
>>
>> On 24/12/2008, at 8:22 AM, Rodney M. Bates wrote:
>>
>>> I hear three problems with CM3 TEXT:
>>>
>>> 1) WIDECHAR and the TEXT implementation won't handle Unicode values
>>> that
>>>  exceed 216-1.
>>>
>>> 2) The CM3 TEXT implementation has serious inefficiencies in at
>>> least some
>>>  realistic cases.
>>>
>>> 3) We want some kind of compatibility with other software.
>>>
>>> As for 2), the implementation could be greatly improved without
>>> changing
>>> the existing TEXT abstraction.  It could even be improved a lot
>>> without
>>> changing the existing data structures, only the algorithms.
>>>
>>> Back in March, I conjectured about improving get_chars for a
>>> concatenation
>>> by recursing on only one side and iterating on the other, but I
>>> never did
>>> anything about it.  Text.Cat could flatten concatenations that are
>>> short
>>> into one of the atomic representations.  It could perhaps do a two-
>>> level
>>> rebalance of concatenation trees.  All these would help at least
>>> somewhat
>>> with the performance consequences of building a string one character
>>> at
>>> a time by concatenation.
>>>
>>> And of course, ignoring the needs of those who have violated the  
>>> TEXT
>>> abstraction, we could come up with entirely new internal
>>> representations.
>>> We could even just add some new ones and support mixtures of both
>>> the old
>>> and new ones.  This would really just be an expansion of what we
>>> already
>>> have, which has several representations, with all the operations
>>> dynamically
>>> checking and adapting to them.
>>>
>>> On the other hand, I do know from hard experience, that the
>>> implementation
>>> size and complexity go up surprisingly as the number of alternative
>>> representations goes up.  It's really a sort of cartesion product of
>>> operations, operands for each operation, and representations for  
>>> each
>>> operand.
>>>
>>> Something that has not been mentioned is the _space_ overhead of the
>>> existing concatenation scheme.  Proponents of the pure or nearly- 
>>> pure
>>> OO language design philosophy (e.g. Smalltalk, Java) seem pretty
>>> oblivious to how much you lose with lots of separately-allocated
>>> heap objects connected by pointers.
>>>
>>> Aside from the extra pointers, there is allocator/GC overhead
>>> per object and fragmentation loss.  Then there's loss of
>>> reference locality, leading to bigger working sets, which spills
>>> back into time overhead.  And with heap-allocated open arrays,
>>> there are additional 1+NoOfDimensions words as well.  So I am
>>> strongly in favor of cutting down on the number of separate objects,
>>> wherever reasonable.
>>>
>>> Rodney Bates