[M3devel] text inefficiency? good mutable string type? arrays?

Tony Hosking hosking at cs.purdue.edu
Tue Feb 26 23:43:17 CET 2008


I suppose the question here is how often you exceed the 256-character  
stack-allocated buffer size?  If not often then occasionally  
allocating in the heap is not a huge problem.  I'm not too fussed  
about this anyway because generational GC will quickly reclaim things  
that die as young as this buffer does.  I think you underestimate the  
effectiveness of modern GC algorithms.  Anyway, optimizing here for  
short nm texts presumably handles the common case.  Why work hard to  
optimize the uncommon case?

On Feb 25, 2008, at 10:38 AM, Jay wrote:

> I know this area has been brewing under the surface a while.
> I'll bring it up. :)
>
> Here is some apparently typical code:
>
> PROCEDURE Escape (nm: TEXT): TEXT =
>   VAR len: INTEGER;   buf: ARRAY [0..255] OF CHAR;
>   BEGIN
>     IF (nm = NIL) THEN RETURN NIL; END;
>     len := Text.Length (nm);
>     IF (len + len <= NUMBER (buf))
>       THEN RETURN DoEscape (nm, len, buf);
>       ELSE RETURN DoEscape (nm, len, NEW (REF ARRAY OF CHAR, len +  
> len)^);
>     END;
>   END Escape;
>
> PROCEDURE DoEscape (nm: TEXT;  len: INTEGER;  VAR buf: ARRAY OF  
> CHAR): TEXT =
>   VAR n_escapes := 0;  src, dest: INTEGER;  c: CHAR;
>   BEGIN
>     Text.SetChars (buf, nm);
>     FOR i := 0 TO len-1 DO
>       IF (buf[i] = BackSlash) THEN INC (n_escapes); END;
>     END;
>     IF (n_escapes = 0) THEN RETURN nm; END;
>     src  := len - 1;
>     dest := src + n_escapes;
>     WHILE (src > 0) DO
>       c := buf[src];  DEC (src);
>       buf[dest] := c;  DEC (dest);
>       IF (c = BackSlash) THEN  buf[dest] := BackSlash;  DEC (dest);   
> END;
>     END;
>     RETURN Text.FromChars (SUBARRAY (buf, 0, len + n_escapes));
>   END DoEscape;
>
>
> Look at how inefficient this is.
> For a long string it makes a heap allocation, even if in the end it  
> makes no changes to the string.
> For any length string it copies it out to an array.
> Then if there are any changes, it copies it again, probably into  
> heap, likely the input immediately becoming garbage.
> Heap allocation may be fast, but building up garbage and causing  
> more work for the garbage collector I assume is bad.
> And if the string had no backslashes, it's all a big waste.
>
> I assume texts are read only?
>
> I know lots of systems have read only strings.
> There are pluses and minus to them. They can be single-instanced.
> Some systems with read only strings have another type, such as  
> "StringBuilder" or "StringBuffer".
>
>
> So -- I don't have a specific question, but maybe a mutable string  
> "class" aka "type" is needed?
> Not necessarily in the language but in m3core or libm3?
> Maybe it's already there?
>
> I just spent a few hours diddling with arrays of chars.
> Unfortunately they are not resizable.
> It was not entirely satisfactory. Besides the array I had to pass  
> around a length and a start.
> Wrapping this up in one record TYPE MutableString= ... might be a  
> good idea.
>
> For more efficient read only access, would it be reasonable for the  
> runtime to materialize on-demand 8 bit and 16 bit representations of  
> a string if a user calls some new thing like Text.GetDirectA (t:  
> TEXT) : REF ARRAY OF CHAR, Text.GetDirectW (t: TEXT) : REF ARRAY OF  
> WIDECHAR? Throw an exception if the string cannot be represented  
> with 8 bit characters? Or use utf8?
>
> Besides, I know I'm a big predictable whiner but I like how this  
> works in Windows..
> It may not have been as seamless, but it works and it really doesn't  
> tend to break or slow down existing code.
> roughly:
>   "foo" is an 8 bit string of type char* (or const char* or possibly  
> const char[4])
>   L"foo" is an 16 bit string of type wchar_t* (ditto, and aka WCHAR)
>   "L" for "long" aka "wide"
>
> Functions must be written to specifically use one or the other.
> In C++ you can templatize. There is std::string and std::wstring  
> that are template instantiations.
> Lots of C functions are duplicated. strcpy => wcscpy, strcat =>  
> wcscat, etc.
> And really there's no point in 8 bit strings.
> If you have a blob, that's an array of bytes or something, not  
> characters.
>
> It works.
>
> Utf8 is another seemingly popular route but I think it's a hack.
> I think mostly people don't touch their code and say, voila, it's  
> utf8, and they only really support the same old English or possibly  
> 8 bit characters (some European characters).
> Granted, to some extent, this does work, as long as you don't do  
> anything with the string but strlen, strcpy, and some others and  
> pass it to code that does treat it correctly. Still, variably sized  
> encodings seem like a bad idea here, and 16 bits per character seem  
> affordable enough. And yes, I know that Unicode is actually know 20  
> bits per character and some characters take two wchar_ts but I try  
> to ignore that...
> And I know there is a lot of existing code, but sometimes there is a  
> need for progress too...
>
> Ok, maybe NOW I'll look at the cygwin/shobjgen problem. :)
>
>
>  - Jay
>
>
> Helping your favorite cause is as easy as instant messaging. You IM,  
> we give. Learn more.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20080226/bad6139a/attachment-0002.html>


More information about the M3devel mailing list