[M3devel] This disgusting TEXT business

Sun Dec 21 16:40:15 CET 2008

On Sun, Dec 21, 2008 at 08:08:57AM +0900, Darko wrote:
> The right way to do this, IMNSHO is to not assume any particular  
> representation of TEXT values and create an implementation interface  
> that allows implementations of multiple text representations, much like 
> Rd and Wr don't make many assumptions about how data is actually stored 
> or retrieved.

Such an interface may be needed for UTF-8 alone already, anyway,
because within UTF-8 there is in some cases more than one way
to store what amounts to the same data to a human user.

In Subversion, from the beginning everyone agreed that the internal
encoding for all strings would be UTF-8. Most Subversion APIs expect
data in UTF-8. Strings (e.g. filenames) in the repository are stored
in UTF-8, etc. Great! Will work in all countries! Right?

Yes, but not on all operating systems if you're not careful!
It did not occur to anyone at the time that there are characters
which in UTF-8 have more than one representation (codepoints) in a
byte stream. For example, an u with umlaut can be encoded as two
bytes or a single byte:

  2 bytes: [u | the previous character has an umlaut ]
  This is called "normal form decomposed".

  1 byte [u umlaut] (i.e. ü if you can see this on your terminal :)
  This is called "normal form composed".

If you want to be portable, as CM3 and Subversion both try to be,
you have to consider that some operating systems may return your
filenames in a different encoding then you stored it in:

--------
          Accepts   Gives back
MacOS X     *          NFD(*)
Linux       *        <input>
Windows     *        <input>
Others      ?           ?

*) There are some remarks to be made regarding full or partial
  NFD here, but the essential thing is: If you send in NFC, don't
  expect it back!
-------- quoted from:
http://svn.collab.net/repos/svn/trunk/notes/unicode-composition-for-filenames
which is worth a read for more details if you're interested.

In Subversion, this is a real problem for Mac users, because
two filenames which only differ in their NFC/NFD encoding
look exactly the same to the user (an u umlaut is printed),
while the byte streams do not match ("We're sorry, but your
file x does not exist in the repository!", where x looks just
like a file that is clearly visible in the repository listing :)

Subversion's problem now is that there are repositories out
there using filenames in either NFC, NFD, or mixed, and there
is no good way to reconcile the mess while staying backwards
compatible with existing clients, servers, working copies and
repositories. So Mac users are told to only use ASCII characters
in their filenames to prevent the problem (many users, especially
users who are not programmers, who store their photos or their
entire home directory or whatever in Subversion, are not happy
about this).

This problem may not matter as much in case of CM3, but anyone
implementing UTF-8 support for CM3 should be aware of this issue
and not repeat the mistake the Subversion developers made at the
time! With UTF-8, do not rely on a filename to retain its encoding
as you passed it to the OS when requesting the filename from the
OS again.

CM3 should pick either NFD or NFC as internal UTF-8 encoding, for
filenames only, or for all strings, whichever makes more sense.
And then stick to it, converting input/output as needed.

Abstracting this problem away using a nice interface would probably
be the cleanest solution.

Stefan