[M3devel] Windows, Unicode file names

Jay K jay.krell at cornell.edu
Thu Jun 28 07:31:04 CEST 2012


  > Random access by *character* number is easy and, hopefully, implemented
 > with efficiency at least better than O(n).
 

Random access by "something, not 'character'" should be O(1).


> > I kind of thing that immutability and quadratic growth are in conflict.
>
> They are, to a considerable extent, as with all functional-style data structures.
> But more sophisticated (i.e., complicated) implementations can mitigate somewhat.

I'm hoping we can win here somehow.
In Java and C# they solve this by having, in a sense, two string types.
constant "string"s
an mutable "StringBuffer"s
Strings never grow. They are always flat.
StringBuffers grow quadratically. They are always flat. They are mutable.

I suspect we need do something similar. Somehow.

As I understand, C# and Java do expose string concatenation.
As I understand, they are similar to Modula-3 here, in that the compiler knows
about string concatenation and rewrites the code somewhat.
Thinking about it further, I suspect my example also can't/doesn't run performantly in Java or C# either.


Hopefully we can come up with some good solution to this.
I have to run.


 - Jay


----------------------------------------
> Date: Wed, 27 Jun 2012 15:04:59 -0500
> From: rodney_bates at lcwb.coop
> To: m3devel at elegosoft.com
> Subject: Re: [M3devel] Windows, Unicode file names
>
>
>
> On 06/27/2012 05:19 AM, Jay K wrote:
> > > More and more is obvious how ideal structure would be: ARRAY OF CHAR, UTF8 encoded, using SRC M3 Text.Hash().
> >
> > I don't quite agree.
> > There are two ideal approaches.
> > 1)
> > TEXT is like ARRAY OF CHAR and no values over 0xFF (or maybe even 0x7F)
> > "WiDETEXT" is like ARRAY OF WIDECHAR, for 16bit or 32bit WIDECHAR
> >
> >
> > 2) something that can change between them, or possibly store both, but is still mainly flat arrays
> > That is, once you store a value over 0xFF, the internal represenation changes to flat array of WIDECHAR.
> > Probably it stays that way -- you don't want to thrash back and forth in worst case.
> > Lesser evil is probably to stick with wide represenation.
> > Setting the string to empty might bounce it back narrow.
> > Ditto assigning it from another narrow text, maybe.
> >
>
> This is similar to what the cm3 modification of Text does now. The details of
> what goes on inside the implementation are a bit different than you describe.
> There can be mixtures of 8-bit string fragments and 16-bit string fragments, plus
> other stuff hooking them together. But the abstraction works just like this.
>
> >
> >
> > What I don't yet understand in all this is how to efficiently combine thread safety, immutability, and quadratic growth.
> >
> > The following should be as efficient as in typical C++ libraries:
> >
> >
> > VAR a: TEXT;
>
> VAR a: TEXT:= " ";
>
> > WHILE TRUE DO
> > a := a & " ";
> > END;
> >
>
> In pm3 Text, this will take quadratic time and linear space. The partial
> strings will be garbage collected, as no copies of the pointers to them
> are made. GetChar is then O(1).
>
> In cm3 Text, this is linear in both time and space, but the space usage
> has a much higher constant factor than in pm3. In pm3, the asymptotic space used
> is exactly what the characters themselves require, i.e, one byte per character.
> For cm3, I count 21 native words per character, plus fragmentation loss
> for 3 separate heap objects per character. That's 84 times or 168 times,
> depending on word size. Well, lots of people keep saying RAM is virtually
> free these days. I guess we really need to hope they are right.
> GetChar is O(n) when the string is built linearly like this.
> Best case is O(log n) when built by Cats of single characters.
>
> My modification of cm3 Text lies between these. It flattens strings
> up to a point, then does some imperfect balancing of them higher in trees.
>
> Frankly, I think I like going back to the pm3 implementation best.
>
> >
> > I kind of thing that immutability and quadratic growth are in conflict.
>
> They are, to a considerable extent, as with all functional-style data structures.
> But more sophisticated (i.e., complicated) implementations can mitigate somewhat.
>
> > But not because that sounds obvious.
> > Note that typical C++ libraries do have value semantics for std::string and std::vector.
> >
> >
> > - Jay
> >
> >
> > > From: dragisha at m3w.org
> > > Date: Wed, 27 Jun 2012 11:52:53 +0200
> > > To: mika at async.caltech.edu
> > > CC: m3devel at elegosoft.com
> > > Subject: Re: [M3devel] Windows, Unicode file names
> > >
> > > More and more is obvious how ideal structure would be: ARRAY OF CHAR, UTF8 encoded, using SRC M3 Text.Hash().
> > >
> > > What we need is to make compler map from input encoding (whatever user chooses or is choosen for him) to internal UTF8.
> > >
> > > On Jun 26, 2012, at 8:50 PM, Mika Nystrom wrote:
> > >
> > > >
> > > > As far as I know, SRC M3 and PM3 come with a TEXT implementation that
> > > > works exactly as described below. An extra byte is used at the end with
> > > > a character VAL(0,CHAR). The Texts are simply arrays of 8-bit characters.
> > > >
> > > > One of the big advantages of the old version is that Text.Hash is really,
> > > > really fast. Especially on Alphas... it's hugely more expensive to
> > > > have hash tables (i.e., Modula-3 generic Tables) keyed on Texts under
> > > > CM3 than under the old compilers and runtimes. We're talking a factor
> > > > of five or so in speed since the Table routines are generally entirely
> > > > dominated by Text.Hash.
> > > >
> > > > Mika
> > > >
> > > > Hendrik Boom writes:
> > > >> On Mon, Jun 25, 2012 at 08:46:18PM +0000, Jay K wrote:
> > > >>>
> > > >>> Somewhat but not fully. Text.Length should fetch a stored length. As
> > > >>> I'm sure it already does.That length should always be correctly
> > > >>> maintained. Same as today.Adding one extra nul at the end doesn't
> > > >>> invalidate the data.std::string has the same properties -- c_str() can
> > > >>> on-demand append a terminal nul,but there could also be one in the
> > > >>> string itself.I understand it is a bit wierd. Maintaining a terminal
> > > >>> nul does add cost that might be wasted.And reduces the capacity by
> > > >>> one.It could be on-demand, I guess. - Jay
> > > >>
> > > >> Don't need the 'on demand'. For the benefits of C interoperability, the
> > > >> extra byte is well worth the price. What I'm worrying about is someone
> > > >> using an enbedded NUL as an end-of-string marker. I smell more bugs
> > > >> creeping in. But I guess bug are inherent in C use, so I'm not
> > > >> surprised seeing them in C interoperation.
> > > >>
> > > >> -- hendrik
> > >
 		 	   		  


More information about the M3devel mailing list