[M3devel] text inefficiency? good mutable string type? arrays?

Jay jayk123 at hotmail.com
Wed Feb 27 00:46:26 CET 2008


Tony, what about the need to copy? The inability to iterate through characters without a function call per character? I guess though, from what you pointed out, a read only direct pointer is viable. And that removes the 256 character stack usage -- which is a bit piggy as it is. So then..I guess the next step in some compromise here is that it should perhaps be viable to get a read only pointer to a string (maybe already the case), compute the required size of the new one (follows naturally), and then be able to split up allocation and filling in of a string, a function to allocate and get a direct pointer, then fill in your data, then "seal" the text making it read only. And I guess really you can just carefully use the "subversive" method here -- up to you to not pass the "unsealed" text off to anyone else until you are done writing to it.
 
Then just to deal with the varying representation but as I was suggesting, maybe that can be forced somehow.
Maybe creation can specify an encoding, or the "get buffer" function could -- though ideally there is just one, not two.
 
The stack vs. heap behavior is also not ideal in that it's nice for the performance of something to scale linearly with input, not suddenly slow way down when data exceeds some arbitrary threshold. Granted, a lot of code just starts failing, so merely slowing down is "progress". As well, eventually you end up swapping to disk, if you don't first run out of address space, so there will be limits at which things slow down or fail, just that they ought to be fairly large.
 
I also rather hoped/assumed that compile time text constants were formed constant-ly by the compiler, that the compiler knows their runtime layout. I realize the division of labor between compiler and runtime can be made at varying points, with varying resulting flexibility, but also varying resulting efficiency.
 
 - Jay


From: hosking at cs.purdue.eduTo: m3devel at elegosoft.comDate: Tue, 26 Feb 2008 17:43:17 -0500Subject: Re: [M3devel] text inefficiency? good mutable string type? arrays?


I suppose the question here is how often you exceed the 256-character stack-allocated buffer size?  If not often then occasionally allocating in the heap is not a huge problem.  I'm not too fussed about this anyway because generational GC will quickly reclaim things that die as young as this buffer does.  I think you underestimate the effectiveness of modern GC algorithms.  Anyway, optimizing here for short nm texts presumably handles the common case.  Why work hard to optimize the uncommon case?


On Feb 25, 2008, at 10:38 AM, Jay wrote:

I know this area has been brewing under the surface a while.I'll bring it up. :) Here is some apparently typical code: PROCEDURE Escape (nm: TEXT): TEXT =  VAR len: INTEGER;   buf: ARRAY [0..255] OF CHAR;  BEGIN    IF (nm = NIL) THEN RETURN NIL; END;    len := Text.Length (nm);    IF (len + len <= NUMBER (buf))      THEN RETURN DoEscape (nm, len, buf);      ELSE RETURN DoEscape (nm, len, NEW (REF ARRAY OF CHAR, len + len)^);    END;  END Escape;PROCEDURE DoEscape (nm: TEXT;  len: INTEGER;  VAR buf: ARRAY OF CHAR): TEXT =  VAR n_escapes := 0;  src, dest: INTEGER;  c: CHAR;  BEGIN    Text.SetChars (buf, nm);    FOR i := 0 TO len-1 DO      IF (buf[i] = BackSlash) THEN INC (n_escapes); END;    END;    IF (n_escapes = 0) THEN RETURN nm; END;    src  := len - 1;    dest := src + n_escapes;    WHILE (src > 0) DO      c := buf[src];  DEC (src);      buf[dest] := c;  DEC (dest);      IF (c = BackSlash) THEN  buf[dest] := BackSlash;  DEC (dest);  END;    END;    RETURN Text.FromChars (SUBARRAY (buf, 0, len + n_escapes));  END DoEscape;Look at how inefficient this is.For a long string it makes a heap allocation, even if in the end it makes no changes to the string.For any length string it copies it out to an array.Then if there are any changes, it copies it again, probably into heap, likely the input immediately becoming garbage.Heap allocation may be fast, but building up garbage and causing more work for the garbage collector I assume is bad.And if the string had no backslashes, it's all a big waste. I assume texts are read only? I know lots of systems have read only strings.There are pluses and minus to them. They can be single-instanced.Some systems with read only strings have another type, such as "StringBuilder" or "StringBuffer".So -- I don't have a specific question, but maybe a mutable string "class" aka "type" is needed?Not necessarily in the language but in m3core or libm3?Maybe it's already there? I just spent a few hours diddling with arrays of chars.Unfortunately they are not resizable.It was not entirely satisfactory. Besides the array I had to pass around a length and a start.Wrapping this up in one record TYPE MutableString= ... might be a good idea. For more efficient read only access, would it be reasonable for the runtime to materialize on-demand 8 bit and 16 bit representations of a string if a user calls some new thing like Text.GetDirectA (t: TEXT) : REF ARRAY OF CHAR, Text.GetDirectW (t: TEXT) : REF ARRAY OF WIDECHAR? Throw an exception if the string cannot be represented with 8 bit characters? Or use utf8? Besides, I know I'm a big predictable whiner but I like how this works in Windows..It may not have been as seamless, but it works and it really doesn't tend to break or slow down existing code.roughly:  "foo" is an 8 bit string of type char* (or const char* or possibly const char[4])   L"foo" is an 16 bit string of type wchar_t* (ditto, and aka WCHAR)   "L" for "long" aka "wide"  Functions must be written to specifically use one or the other.In C++ you can templatize. There is std::string and std::wstring that are template instantiations.Lots of C functions are duplicated. strcpy => wcscpy, strcat => wcscat, etc.And really there's no point in 8 bit strings.If you have a blob, that's an array of bytes or something, not characters.  It works. Utf8 is another seemingly popular route but I think it's a hack.I think mostly people don't touch their code and say, voila, it's utf8, and they only really support the same old English or possibly 8 bit characters (some European characters).Granted, to some extent, this does work, as long as you don't do anything with the string but strlen, strcpy, and some others and pass it to code that does treat it correctly. Still, variably sized encodings seem like a bad idea here, and 16 bits per character seem affordable enough. And yes, I know that Unicode is actually know 20 bits per character and some characters take two wchar_ts but I try to ignore that...And I know there is a lot of existing code, but sometimes there is a need for progress too... Ok, maybe NOW I'll look at the cygwin/shobjgen problem. :)   - Jay 

Helping your favorite cause is as easy as instant messaging. You IM, we give. Learn more.
_________________________________________________________________
Need to know the score, the latest news, or you need your Hotmail®-get your "fix".
http://www.msnmobilefix.com/Default.aspx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20080226/a9fe8b73/attachment-0002.html>


More information about the M3devel mailing list