[M3devel] text inefficiency? good mutable string type? arrays?

Jay jayk123 at hotmail.com
Mon Feb 25 16:38:38 CET 2008


I know this area has been brewing under the surface a while.
I'll bring it up. :)
 
Here is some apparently typical code:
 
PROCEDURE Escape (nm: TEXT): TEXT =  VAR len: INTEGER;   buf: ARRAY [0..255] OF CHAR;  BEGIN    IF (nm = NIL) THEN RETURN NIL; END;    len := Text.Length (nm);    IF (len + len <= NUMBER (buf))      THEN RETURN DoEscape (nm, len, buf);      ELSE RETURN DoEscape (nm, len, NEW (REF ARRAY OF CHAR, len + len)^);    END;  END Escape;
PROCEDURE DoEscape (nm: TEXT;  len: INTEGER;  VAR buf: ARRAY OF CHAR): TEXT =  VAR n_escapes := 0;  src, dest: INTEGER;  c: CHAR;  BEGIN    Text.SetChars (buf, nm);    FOR i := 0 TO len-1 DO      IF (buf[i] = BackSlash) THEN INC (n_escapes); END;    END;    IF (n_escapes = 0) THEN RETURN nm; END;    src  := len - 1;    dest := src + n_escapes;    WHILE (src > 0) DO      c := buf[src];  DEC (src);      buf[dest] := c;  DEC (dest);      IF (c = BackSlash) THEN  buf[dest] := BackSlash;  DEC (dest);  END;    END;    RETURN Text.FromChars (SUBARRAY (buf, 0, len + n_escapes));  END DoEscape;Look at how inefficient this is.
For a long string it makes a heap allocation, even if in the end it makes no changes to the string.
For any length string it copies it out to an array.
Then if there are any changes, it copies it again, probably into heap, likely the input immediately becoming garbage.
Heap allocation may be fast, but building up garbage and causing more work for the garbage collector I assume is bad.
And if the string had no backslashes, it's all a big waste.
 
I assume texts are read only?
 
I know lots of systems have read only strings.
There are pluses and minus to them. They can be single-instanced.
Some systems with read only strings have another type, such as "StringBuilder" or "StringBuffer".
So -- I don't have a specific question, but maybe a mutable string "class" aka "type" is needed?Not necessarily in the language but in m3core or libm3?
Maybe it's already there?
 
I just spent a few hours diddling with arrays of chars.
Unfortunately they are not resizable.
It was not entirely satisfactory. Besides the array I had to pass around a length and a start.
Wrapping this up in one record TYPE MutableString= ... might be a good idea.
 
For more efficient read only access, would it be reasonable for the runtime to materialize on-demand 8 bit and 16 bit representations of a string if a user calls some new thing like Text.GetDirectA (t: TEXT) : REF ARRAY OF CHAR, Text.GetDirectW (t: TEXT) : REF ARRAY OF WIDECHAR? Throw an exception if the string cannot be represented with 8 bit characters? Or use utf8?
 
Besides, I know I'm a big predictable whiner but I like how this works in Windows..
It may not have been as seamless, but it works and it really doesn't tend to break or slow down existing code.
roughly:
  "foo" is an 8 bit string of type char* (or const char* or possibly const char[4]) 
  L"foo" is an 16 bit string of type wchar_t* (ditto, and aka WCHAR) 
  "L" for "long" aka "wide" 
 
Functions must be written to specifically use one or the other.
In C++ you can templatize. There is std::string and std::wstring that are template instantiations.
Lots of C functions are duplicated. strcpy => wcscpy, strcat => wcscat, etc.
And really there's no point in 8 bit strings.
If you have a blob, that's an array of bytes or something, not characters. 
 
It works.
 
Utf8 is another seemingly popular route but I think it's a hack.
I think mostly people don't touch their code and say, voila, it's utf8, and they only really support the same old English or possibly 8 bit characters (some European characters).
Granted, to some extent, this does work, as long as you don't do anything with the string but strlen, strcpy, and some others and pass it to code that does treat it correctly. Still, variably sized encodings seem like a bad idea here, and 16 bits per character seem affordable enough. And yes, I know that Unicode is actually know 20 bits per character and some characters take two wchar_ts but I try to ignore that...
And I know there is a lot of existing code, but sometimes there is a need for progress too...
 
Ok, maybe NOW I'll look at the cygwin/shobjgen problem. :)
 
 
 - Jay
 
_________________________________________________________________
Helping your favorite cause is as easy as instant messaging. You IM, we give.
http://im.live.com/Messenger/IM/Home/?source=text_hotmail_join
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20080225/0af67b8f/attachment-0002.html>


More information about the M3devel mailing list