[M3devel] text inefficiency? good mutable string type? arrays?
Tony Hosking
hosking at cs.purdue.edu
Wed Feb 27 03:49:50 CET 2008
On Feb 26, 2008, at 6:46 PM, Jay wrote:
> Tony, what about the need to copy? The inability to iterate through
> characters without a function call per character? I guess though,
> from what you pointed out, a read only direct pointer is viable. And
> that removes the 256 character stack usage -- which is a bit piggy
> as it is. So then..I guess the next step in some compromise here is
> that it should perhaps be viable to get a read only pointer to a
> string (maybe already the case), compute the required size of the
> new one (follows naturally), and then be able to split up allocation
> and filling in of a string, a function to allocate and get a direct
> pointer, then fill in your data, then "seal" the text making it read
> only. And I guess really you can just carefully use the "subversive"
> method here -- up to you to not pass the "unsealed" text off to
> anyone else until you are done writing to it.
Have you looked at TextConv.i3? Not sure if that has some of what you
are needing.
> Then just to deal with the varying representation but as I was
> suggesting, maybe that can be forced somehow.
> Maybe creation can specify an encoding, or the "get buffer" function
> could -- though ideally there is just one, not two.
>
> The stack vs. heap behavior is also not ideal in that it's nice for
> the performance of something to scale linearly with input, not
> suddenly slow way down when data exceeds some arbitrary threshold.
> Granted, a lot of code just starts failing, so merely slowing down
> is "progress". As well, eventually you end up swapping to disk, if
> you don't first run out of address space, so there will be limits at
> which things slow down or fail, just that they ought to be fairly
> large.
Why do you think that heap allocation is slow in general?
> I also rather hoped/assumed that compile time text constants were
> formed constant-ly by the compiler, that the compiler knows their
> runtime layout. I realize the division of labor between compiler and
> runtime can be made at varying points, with varying resulting
> flexibility, but also varying resulting efficiency.
Compile-time literals *are* formed constantly by the compiler. They
are TextLiteral.T.
>
>
> - Jay
> From: hosking at cs.purdue.edu
> To: m3devel at elegosoft.com
> Date: Tue, 26 Feb 2008 17:43:17 -0500
> Subject: Re: [M3devel] text inefficiency? good mutable string type?
> arrays?
>
> I suppose the question here is how often you exceed the 256-
> character stack-allocated buffer size? If not often then
> occasionally allocating in the heap is not a huge problem. I'm not
> too fussed about this anyway because generational GC will quickly
> reclaim things that die as young as this buffer does. I think you
> underestimate the effectiveness of modern GC algorithms. Anyway,
> optimizing here for short nm texts presumably handles the common
> case. Why work hard to optimize the uncommon case?
>
> On Feb 25, 2008, at 10:38 AM, Jay wrote:
>
> I know this area has been brewing under the surface a while.
> I'll bring it up. :)
>
> Here is some apparently typical code:
>
> PROCEDURE Escape (nm: TEXT): TEXT =
> VAR len: INTEGER; buf: ARRAY [0..255] OF CHAR;
> BEGIN
> IF (nm = NIL) THEN RETURN NIL; END;
> len := Text.Length (nm);
> IF (len + len <= NUMBER (buf))
> THEN RETURN DoEscape (nm, len, buf);
> ELSE RETURN DoEscape (nm, len, NEW (REF ARRAY OF CHAR, len +
> len)^);
> END;
> END Escape;
>
> PROCEDURE DoEscape (nm: TEXT; len: INTEGER; VAR buf: ARRAY OF
> CHAR): TEXT =
> VAR n_escapes := 0; src, dest: INTEGER; c: CHAR;
> BEGIN
> Text.SetChars (buf, nm);
> FOR i := 0 TO len-1 DO
> IF (buf[i] = BackSlash) THEN INC (n_escapes); END;
> END;
> IF (n_escapes = 0) THEN RETURN nm; END;
> src := len - 1;
> dest := src + n_escapes;
> WHILE (src > 0) DO
> c := buf[src]; DEC (src);
> buf[dest] := c; DEC (dest);
> IF (c = BackSlash) THEN buf[dest] := BackSlash; DEC (dest);
> END;
> END;
> RETURN Text.FromChars (SUBARRAY (buf, 0, len + n_escapes));
> END DoEscape;
>
>
> Look at how inefficient this is.
> For a long string it makes a heap allocation, even if in the end it
> makes no changes to the string.
> For any length string it copies it out to an array.
> Then if there are any changes, it copies it again, probably into
> heap, likely the input immediately becoming garbage.
> Heap allocation may be fast, but building up garbage and causing
> more work for the garbage collector I assume is bad.
> And if the string had no backslashes, it's all a big waste.
>
> I assume texts are read only?
>
> I know lots of systems have read only strings.
> There are pluses and minus to them. They can be single-instanced.
> Some systems with read only strings have another type, such as
> "StringBuilder" or "StringBuffer".
>
>
> So -- I don't have a specific question, but maybe a mutable string
> "class" aka "type" is needed?
> Not necessarily in the language but in m3core or libm3?
> Maybe it's already there?
>
> I just spent a few hours diddling with arrays of chars.
> Unfortunately they are not resizable.
> It was not entirely satisfactory. Besides the array I had to pass
> around a length and a start.
> Wrapping this up in one record TYPE MutableString= ... might be a
> good idea.
>
> For more efficient read only access, would it be reasonable for the
> runtime to materialize on-demand 8 bit and 16 bit representations of
> a string if a user calls some new thing like Text.GetDirectA (t:
> TEXT) : REF ARRAY OF CHAR, Text.GetDirectW (t: TEXT) : REF ARRAY OF
> WIDECHAR? Throw an exception if the string cannot be represented
> with 8 bit characters? Or use utf8?
>
> Besides, I know I'm a big predictable whiner but I like how this
> works in Windows..
> It may not have been as seamless, but it works and it really doesn't
> tend to break or slow down existing code.
> roughly:
> "foo" is an 8 bit string of type char* (or const char* or possibly
> const char[4])
> L"foo" is an 16 bit string of type wchar_t* (ditto, and aka WCHAR)
> "L" for "long" aka "wide"
>
> Functions must be written to specifically use one or the other.
> In C++ you can templatize. There is std::string and std::wstring
> that are template instantiations.
> Lots of C functions are duplicated. strcpy => wcscpy, strcat =>
> wcscat, etc.
> And really there's no point in 8 bit strings.
> If you have a blob, that's an array of bytes or something, not
> characters.
>
> It works.
>
> Utf8 is another seemingly popular route but I think it's a hack.
> I think mostly people don't touch their code and say, voila, it's
> utf8, and they only really support the same old English or possibly
> 8 bit characters (some European characters).
> Granted, to some extent, this does work, as long as you don't do
> anything with the string but strlen, strcpy, and some others and
> pass it to code that does treat it correctly. Still, variably sized
> encodings seem like a bad idea here, and 16 bits per character seem
> affordable enough. And yes, I know that Unicode is actually know 20
> bits per character and some characters take two wchar_ts but I try
> to ignore that...
> And I know there is a lot of existing code, but sometimes there is a
> need for progress too...
>
> Ok, maybe NOW I'll look at the cygwin/shobjgen problem. :)
>
>
> - Jay
>
>
> Helping your favorite cause is as easy as instant messaging. You IM,
> we give. Learn more.
>
>
> Need to know the score, the latest news, or you need your Hotmail®-
> get your "fix". Check it out.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20080226/08cce521/attachment-0002.html>
More information about the M3devel
mailing list