[M3devel] text inefficiency? good mutable string type? arrays?

Wed Feb 27 03:49:50 CET 2008

On Feb 26, 2008, at 6:46 PM, Jay wrote:

> Tony, what about the need to copy? The inability to iterate through  
> characters without a function call per character? I guess though,  
> from what you pointed out, a read only direct pointer is viable. And  
> that removes the 256 character stack usage -- which is a bit piggy  
> as it is. So then..I guess the next step in some compromise here is  
> that it should perhaps be viable to get a read only pointer to a  
> string (maybe already the case), compute the required size of the  
> new one (follows naturally), and then be able to split up allocation  
> and filling in of a string, a function to allocate and get a direct  
> pointer, then fill in your data, then "seal" the text making it read  
> only. And I guess really you can just carefully use the "subversive"  
> method here -- up to you to not pass the "unsealed" text off to  
> anyone else until you are done writing to it.

Have you looked at TextConv.i3?  Not sure if that has some of what you  
are needing.

> Then just to deal with the varying representation but as I was  
> suggesting, maybe that can be forced somehow.
> Maybe creation can specify an encoding, or the "get buffer" function  
> could -- though ideally there is just one, not two.
>
> The stack vs. heap behavior is also not ideal in that it's nice for  
> the performance of something to scale linearly with input, not  
> suddenly slow way down when data exceeds some arbitrary threshold.  
> Granted, a lot of code just starts failing, so merely slowing down  
> is "progress". As well, eventually you end up swapping to disk, if  
> you don't first run out of address space, so there will be limits at  
> which things slow down or fail, just that they ought to be fairly  
> large.

Why do you think that heap allocation is slow in general?

> I also rather hoped/assumed that compile time text constants were  
> formed constant-ly by the compiler, that the compiler knows their  
> runtime layout. I realize the division of labor between compiler and  
> runtime can be made at varying points, with varying resulting  
> flexibility, but also varying resulting efficiency.

Compile-time literals *are* formed constantly by the compiler.   They  
are TextLiteral.T.

>
>
>  - Jay
> From: hosking at cs.purdue.edu
> To: m3devel at elegosoft.com
> Date: Tue, 26 Feb 2008 17:43:17 -0500
> Subject: Re: [M3devel] text inefficiency? good mutable string type?  
> arrays?
>
> I suppose the question here is how often you exceed the 256- 
> character stack-allocated buffer size?  If not often then  
> occasionally allocating in the heap is not a huge problem.  I'm not  
> too fussed about this anyway because generational GC will quickly  
> reclaim things that die as young as this buffer does.  I think you  
> underestimate the effectiveness of modern GC algorithms.  Anyway,  
> optimizing here for short nm texts presumably handles the common  
> case.  Why work hard to optimize the uncommon case?
>
> On Feb 25, 2008, at 10:38 AM, Jay wrote:
>
> I know this area has been brewing under the surface a while.
> I'll bring it up. :)
>
> Here is some apparently typical code:
>
> PROCEDURE Escape (nm: TEXT): TEXT =
>   VAR len: INTEGER;   buf: ARRAY [0..255] OF CHAR;
>   BEGIN
>     IF (nm = NIL) THEN RETURN NIL; END;
>     len := Text.Length (nm);
>     IF (len + len <= NUMBER (buf))
>       THEN RETURN DoEscape (nm, len, buf);
>       ELSE RETURN DoEscape (nm, len, NEW (REF ARRAY OF CHAR, len +  
> len)^);
>     END;
>   END Escape;
>
> PROCEDURE DoEscape (nm: TEXT;  len: INTEGER;  VAR buf: ARRAY OF  
> CHAR): TEXT =
>   VAR n_escapes := 0;  src, dest: INTEGER;  c: CHAR;
>   BEGIN
>     Text.SetChars (buf, nm);
>     FOR i := 0 TO len-1 DO
>       IF (buf[i] = BackSlash) THEN INC (n_escapes); END;
>     END;
>     IF (n_escapes = 0) THEN RETURN nm; END;
>     src  := len - 1;
>     dest := src + n_escapes;
>     WHILE (src > 0) DO
>       c := buf[src];  DEC (src);
>       buf[dest] := c;  DEC (dest);
>       IF (c = BackSlash) THEN  buf[dest] := BackSlash;  DEC (dest);   
> END;
>     END;
>     RETURN Text.FromChars (SUBARRAY (buf, 0, len + n_escapes));
>   END DoEscape;
>
>
> Look at how inefficient this is.
> For a long string it makes a heap allocation, even if in the end it  
> makes no changes to the string.
> For any length string it copies it out to an array.
> Then if there are any changes, it copies it again, probably into  
> heap, likely the input immediately becoming garbage.
> Heap allocation may be fast, but building up garbage and causing  
> more work for the garbage collector I assume is bad.
> And if the string had no backslashes, it's all a big waste.
>
> I assume texts are read only?
>
> I know lots of systems have read only strings.
> There are pluses and minus to them. They can be single-instanced.
> Some systems with read only strings have another type, such as  
> "StringBuilder" or "StringBuffer".
>
>
> So -- I don't have a specific question, but maybe a mutable string  
> "class" aka "type" is needed?
> Not necessarily in the language but in m3core or libm3?
> Maybe it's already there?
>
> I just spent a few hours diddling with arrays of chars.
> Unfortunately they are not resizable.
> It was not entirely satisfactory. Besides the array I had to pass  
> around a length and a start.
> Wrapping this up in one record TYPE MutableString= ... might be a  
> good idea.
>
> For more efficient read only access, would it be reasonable for the  
> runtime to materialize on-demand 8 bit and 16 bit representations of  
> a string if a user calls some new thing like Text.GetDirectA (t:  
> TEXT) : REF ARRAY OF CHAR, Text.GetDirectW (t: TEXT) : REF ARRAY OF  
> WIDECHAR? Throw an exception if the string cannot be represented  
> with 8 bit characters? Or use utf8?
>
> Besides, I know I'm a big predictable whiner but I like how this  
> works in Windows..
> It may not have been as seamless, but it works and it really doesn't  
> tend to break or slow down existing code.
> roughly:
>   "foo" is an 8 bit string of type char* (or const char* or possibly  
> const char[4])
>   L"foo" is an 16 bit string of type wchar_t* (ditto, and aka WCHAR)
>   "L" for "long" aka "wide"
>
> Functions must be written to specifically use one or the other.
> In C++ you can templatize. There is std::string and std::wstring  
> that are template instantiations.
> Lots of C functions are duplicated. strcpy => wcscpy, strcat =>  
> wcscat, etc.
> And really there's no point in 8 bit strings.
> If you have a blob, that's an array of bytes or something, not  
> characters.
>
> It works.
>
> Utf8 is another seemingly popular route but I think it's a hack.
> I think mostly people don't touch their code and say, voila, it's  
> utf8, and they only really support the same old English or possibly  
> 8 bit characters (some European characters).
> Granted, to some extent, this does work, as long as you don't do  
> anything with the string but strlen, strcpy, and some others and  
> pass it to code that does treat it correctly. Still, variably sized  
> encodings seem like a bad idea here, and 16 bits per character seem  
> affordable enough. And yes, I know that Unicode is actually know 20  
> bits per character and some characters take two wchar_ts but I try  
> to ignore that...
> And I know there is a lot of existing code, but sometimes there is a  
> need for progress too...
>
> Ok, maybe NOW I'll look at the cygwin/shobjgen problem. :)
>
>
>  - Jay
>
>
> Helping your favorite cause is as easy as instant messaging. You IM,  
> we give. Learn more.
>
>
> Need to know the score, the latest news, or you need your Hotmail®- 
> get your "fix". Check it out.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20080226/08cce521/attachment-0002.html>