[M3devel] This disgusting TEXT business

Sun Dec 21 00:08:57 CET 2008

The right way to do this, IMNSHO is to not assume any particular  
representation of TEXT values and create an implementation interface  
that allows implementations of multiple text representations, much  
like Rd and Wr don't make many assumptions about how data is actually  
stored or retrieved.

This allows:
- backward compatibility with the existing scheme
- people's favourite representations (I personally don't think UTF-8  
or UTF-16 are much chop as a single solution)
- compatibility with external representation requirements (not having  
to convert each time you call an external API)
- optimisation for particular application implementations

The complication is with allowing representations to change at runtime  
and (automatic) conversion between representations, which I don't  
think adds too much complexity for the possible benefit.

Also we need a proper UNICODE type that can contain only valid Unicode  
characters since WIDECHAR can't represent them all in one value.

On 21/12/2008, at 4:27 AM, Olaf Wagner wrote:

> Hi everybody,
>
> I haven't followed all the discussions recently, but think it may help
> to comment on a few things here.
>
> IIRC the main reason for the current TEXT implementation was to become
> compatible with Java. Java strings and M3 strings should look the same
> for the embedded JVM written in M3. AFAIK this JVM has never been made
> open source nor is there any intention to do it.
> Still it may be a good idea to adhere to the TEXT representation if
> we ever intend to combine Java and M3 in one runtime.
>
> I don't think the implementation of the combined 8/16 bit TEXTs is
> very mature. I think that some improvements could be made. Has anybody
> looked into the details of the implementation? As it is already  
> pointed
> out on the old CM3 notes on the web pages, some things are slower  
> while
> others are now faster. I agree that the implementation is rather  
> suboptimal
> for some common use cases. We'd need a set of use cases and tests  
> agreed
> upon to really compare implementations though. Just one test for cat  
> is
> not enough IMO. This should be the first step for an improvement.
>
> I'd also not object to a separation of the 8 and 16 bit  
> implementations,
> but the type WIDECHAR etc. should remain at least for
> compatibility. We could perhaps even add some automatic conversion
> then between TEXT and WIDETEXT types (though this would not be in the
> M3 spirit :-) If we follow this way, the standard TEXTs could be
> replaced by the old PM3 implementation again.
>
> I'd also like to point out that much of the speed of using the old
> implementation comes by using the internal TextF interface, which
> exposes the internal TEXT representation as an array of CHAR. I'm
> not sure if this is really a good idea. Replacing direct array access
> by Text.GetChar was one of the main adaptions needed when converting
> all the standard packages to the CM3 compiler for the first release.
> I'd rather object to exposing this interface again.
>
> We could even define the standard TEXT representation as UTF8 in CM3
> instead of the standard ISO Latin1 code set, which seems a bit euro-
> centric ;-) I'd support this idea, once we have proper UTF8 support
> in CM3 (which we currently haven't if I am not mistaken, see below).
>
> Quoting Dragiša Durić <dragisha at m3w.org>:
>
>> Whole bussiness of mixed TEXT's - concat of TEXT and WIDETEXT is  
>> either
>> slow, or produces results that make subsequent operations slow - is
>> where problem is with this implementation.
>>
>> IMO, best solution would be to replace internal representation with
>> UTF-8. For whom may be concerned with it - make some external  
>> widechar
>> conversion routines available.
>>
>> That way - concat would be as fast as it can be made and other
>> operations would be realistic - it is how almost everybody does their
>> Unicode strings, after all. Almost everybody - excluding mobile  
>> industry
>> and Microsoft :-), AFAIK.
>>
>> Compiler changes would make transparent conversion of literals.
>> Everything else is already transparent with regard to internal
>> representation.
>
> I am not sure that it is as easy as that. I'd like to see a proper
> update of the specification before we implement such a change.
>
>> I've sent some UTF-8 routines ages ago to Olaf. IIRC, he was  
>> reluctant
>> to accept them as he did not believe C base routines were widespread
>> enough. GNU world has no such reluctance. Everything is UTF8.
>
> No, the problem was that what I got couldn't even be compiled on
> FreeBSD and Solaris. We should not rely on some external GNU  
> libraries,
> but we need to implement the UTF8 functions in CM3. Has anybody done
> that yet?
>
>> If nobody can fill this job in next few weeks, I will probably have  
>> some
>> time this winter. No promise, of course, but list will know :-).
>
> I'd be in favour of any improvements of the TEXT implementation and
> UTF8 support in CM3!
>
>> dd
>>
>> On Sat, 2008-12-20 at 19:26 +1100, Tony Hosking wrote:
>>> Hmm, are we just victims of poor implementation?  Does anyone have  
>>> the
>>> time to improve things?  It would be possible to rip out CM3 TEXT  
>>> and
>>> replace with PM3, but we'd lose WIDECHAR and WideText.T with that
>>> too.  Not sure who that impacts.
>
> I'd be careful here, too. I think we need to understand the problems
> and the impacts of changes better before we rip out much code.
>
> Olaf
> -- 
> Olaf Wagner -- elego Software Solutions GmbH
>               Gustav-Meyer-Allee 25 / Gebäude 12, 13355 Berlin,  
> Germany
> phone: +49 30 23 45 86 96  mobile: +49 177 2345 869  fax: +49 30 23  
> 45 86 95
>   http://www.elegosoft.com | Geschäftsführer: Olaf Wagner | Sitz:  
> Berlin
> Handelregister: Amtsgericht Charlottenburg HRB 77719 | USt-IdNr:  
> DE163214194
>