[M3devel] Windows, Unicode file names

Wed Jun 27 22:04:59 CEST 2012

On 06/27/2012 05:19 AM, Jay K wrote:
>  > More and more is obvious how ideal structure would be: ARRAY OF CHAR, UTF8 encoded, using SRC M3 Text.Hash().
>
> I don't quite agree.
> There are two ideal approaches.
> 1)
> TEXT is like ARRAY OF CHAR and no values over 0xFF (or maybe even 0x7F)
> "WiDETEXT" is like ARRAY OF WIDECHAR, for 16bit or 32bit WIDECHAR
>
>
> 2) something that can change between them, or possibly store both, but is still mainly flat arrays
> That is, once you store a value over 0xFF, the internal represenation changes to flat array of WIDECHAR.
> Probably it stays that way -- you don't want to thrash back and forth in worst case.
> Lesser evil is probably to stick with wide represenation.
> Setting the string to empty might bounce it back narrow.
> Ditto assigning it from another narrow text, maybe.
>

This is similar to what the cm3 modification of Text does now.  The details of
what goes on inside the implementation are a bit different than you describe.
There can be mixtures of 8-bit string fragments and 16-bit string fragments, plus
other stuff hooking them together.  But the abstraction works just like this.

>
>
> What I don't yet understand in all this is how to efficiently combine thread safety, immutability, and quadratic growth.
>
> The following should be as efficient as in typical C++ libraries:
>
>
> VAR a: TEXT;

   VAR a: TEXT:= " ";

> WHILE TRUE DO
> a := a & " ";
> END;
>

In pm3 Text, this will take quadratic time and linear space.  The partial
strings will be garbage collected, as no copies of the pointers to them
are made.  GetChar is then O(1).

In cm3 Text, this is linear in both time and space, but the space usage
has a much higher constant factor than in pm3.  In pm3, the asymptotic space used
is exactly what the characters themselves require, i.e, one byte per character.
For cm3, I count 21 native words per character, plus fragmentation loss
for 3 separate heap objects per character.  That's 84 times or 168 times,
depending on word size.  Well, lots of people keep saying RAM is virtually
free these days.  I guess we really need to hope they are right.
GetChar is O(n) when the string is built linearly like this.
Best case is O(log n) when built by Cats of single characters.

My modification of cm3 Text lies between these.  It flattens strings
up to a point, then does some imperfect balancing of them higher in trees.

Frankly, I think I like going back to the pm3 implementation best.

>
> I kind of thing that immutability and quadratic growth are in conflict.

They are, to a considerable extent, as with all functional-style data structures.
But more sophisticated (i.e., complicated) implementations can mitigate somewhat.

> But not because that sounds obvious.
> Note that typical C++ libraries do have value semantics for std::string and std::vector.
>
>
> - Jay
>
>
>  > From: dragisha at m3w.org
>  > Date: Wed, 27 Jun 2012 11:52:53 +0200
>  > To: mika at async.caltech.edu
>  > CC: m3devel at elegosoft.com
>  > Subject: Re: [M3devel] Windows, Unicode file names
>  >
>  > More and more is obvious how ideal structure would be: ARRAY OF CHAR, UTF8 encoded, using SRC M3 Text.Hash().
>  >
>  > What we need is to make compler map from input encoding (whatever user chooses or is choosen for him) to internal UTF8.
>  >
>  > On Jun 26, 2012, at 8:50 PM, Mika Nystrom wrote:
>  >
>  > >
>  > > As far as I know, SRC M3 and PM3 come with a TEXT implementation that
>  > > works exactly as described below. An extra byte is used at the end with
>  > > a character VAL(0,CHAR). The Texts are simply arrays of 8-bit characters.
>  > >
>  > > One of the big advantages of the old version is that Text.Hash is really,
>  > > really fast. Especially on Alphas... it's hugely more expensive to
>  > > have hash tables (i.e., Modula-3 generic Tables) keyed on Texts under
>  > > CM3 than under the old compilers and runtimes. We're talking a factor
>  > > of five or so in speed since the Table routines are generally entirely
>  > > dominated by Text.Hash.
>  > >
>  > > Mika
>  > >
>  > > Hendrik Boom writes:
>  > >> On Mon, Jun 25, 2012 at 08:46:18PM +0000, Jay K wrote:
>  > >>>
>  > >>> Somewhat but not fully. Text.Length should fetch a stored length. As
>  > >>> I'm sure it already does.That length should always be correctly
>  > >>> maintained. Same as today.Adding one extra nul at the end doesn't
>  > >>> invalidate the data.std::string has the same properties -- c_str() can
>  > >>> on-demand append a terminal nul,but there could also be one in the
>  > >>> string itself.I understand it is a bit wierd. Maintaining a terminal
>  > >>> nul does add cost that might be wasted.And reduces the capacity by
>  > >>> one.It could be on-demand, I guess. - Jay
>  > >>
>  > >> Don't need the 'on demand'. For the benefits of C interoperability, the
>  > >> extra byte is well worth the price. What I'm worrying about is someone
>  > >> using an enbedded NUL as an end-of-string marker. I smell more bugs
>  > >> creeping in. But I guess bug are inherent in C use, so I'm not
>  > >> surprised seeing them in C interoperation.
>  > >>
>  > >> -- hendrik
>  >