[M3devel] This disgusting TEXT business

Wed Dec 24 01:31:50 CET 2008

Might I suggest an interface somewhat like the Rd/Wr interfaces?
One could provide a buffer into which characters are read en masse
instead of going through the method calls for each character..

Of course for equality this isn't necessarily more efficient since
you often can do an early-out: although I believe that Text.Equal
(when performance matters) usually does return TRUE....

I think it is important that basic TEXT operations be very, very
fast.  Many people care about that.  Many of those that don't
wouldn't be using a compiled language anyhow.

    Mika

Darko writes:
>I agree with your list of problems, but you don't need a cartesian  
>product of operations for string operations between multiple  
>representations. All you need is iteration through the characters of a  
>string and concatenation with another string for each representation.  
>The object interface I proposed is little more than a wrapper for the  
>existing Text interface plus the iteration method. The getData and  
>setData methods in that object provide an interface to external APIs.
>
>
>On 24/12/2008, at 8:22 AM, Rodney M. Bates wrote:
>
>> I hear three problems with CM3 TEXT:
>>
>> 1) WIDECHAR and the TEXT implementation won't handle Unicode values  
>> that
>>   exceed 216-1.
>>
>> 2) The CM3 TEXT implementation has serious inefficiencies in at  
>> least some
>>   realistic cases.
>>
>> 3) We want some kind of compatibility with other software.
>>
>> As for 2), the implementation could be greatly improved without  
>> changing
>> the existing TEXT abstraction.  It could even be improved a lot  
>> without
>> changing the existing data structures, only the algorithms.
>>
>> Back in March, I conjectured about improving get_chars for a  
>> concatenation
>> by recursing on only one side and iterating on the other, but I  
>> never did
>> anything about it.  Text.Cat could flatten concatenations that are  
>> short
>> into one of the atomic representations.  It could perhaps do a two- 
>> level
>> rebalance of concatenation trees.  All these would help at least  
>> somewhat
>> with the performance consequences of building a string one character  
>> at
>> a time by concatenation.
>>
>> And of course, ignoring the needs of those who have violated the TEXT
>> abstraction, we could come up with entirely new internal  
>> representations.
>> We could even just add some new ones and support mixtures of both  
>> the old
>> and new ones.  This would really just be an expansion of what we  
>> already
>> have, which has several representations, with all the operations  
>> dynamically
>> checking and adapting to them.
>>
>> On the other hand, I do know from hard experience, that the  
>> implementation
>> size and complexity go up surprisingly as the number of alternative
>> representations goes up.  It's really a sort of cartesion product of
>> operations, operands for each operation, and representations for each
>> operand.
>>
>> Something that has not been mentioned is the _space_ overhead of the
>> existing concatenation scheme.  Proponents of the pure or nearly-pure
>> OO language design philosophy (e.g. Smalltalk, Java) seem pretty
>> oblivious to how much you lose with lots of separately-allocated
>> heap objects connected by pointers.
>>
>> Aside from the extra pointers, there is allocator/GC overhead
>> per object and fragmentation loss.  Then there's loss of
>> reference locality, leading to bigger working sets, which spills
>> back into time overhead.  And with heap-allocated open arrays,
>> there are additional 1+NoOfDimensions words as well.  So I am
>> strongly in favor of cutting down on the number of separate objects,
>> wherever reasonable.
>>
>> Rodney Bates