[M3devel] This disgusting TEXT business

Jay jay.krell at cornell.edu
Sun Dec 21 18:51:14 CET 2008


This is an issue with Unicode, not UTF8, right?
ie: a 16 or 20 or 32 bit encoding has the same problem, right?
 
 
search the web for "unicode precomposed":
 
  http://en.wikipedia.org/wiki/Precomposed_character    http://wikisource.org/wiki/Unicode_precomposed_characters  
 
or "unicode precomposed apple":
 
  http://developer.apple.com/jp/qa/qa2001/qa1235.html  
  http://developer.apple.com/qa/qa2001/qa1235.html 
 
 
"When working within Mac OS you will find yourself using a mixture of precomposed and decomposed Unicode. For example, HFS Plus converts all file names to decomposed Unicode, while Macintosh keyboards generally produce precomposed Unicode. This isn't a problem as long as you use system-provided APIs to process text. Apple's APIs correctly handle both precomposed and decomposed Unicode.
However, you may need to convert to precomposed Unicode when you interact with other platforms. For example, the following are all valid reasons why you might want to convert to precomposed Unicode.
If you implement a network protocol which is defined to use precomposed Unicode. When creating a cross-platform file (or volume) whose specification dictates precomposed Unicode. If you incorporate a large body of cross-platform code into your application, where that code is expecting precomposed Unicode. "
 
  http://www.unicode.org/unicode/reports/tr15/index.html  
  I need to read these..  
 
  - Jay> Date: Sun, 21 Dec 2008 15:40:15 +0000> From: stsp at elego.de> To: darko at darko.org> CC: m3devel at elegosoft.com> Subject: Re: [M3devel] This disgusting TEXT business> > On Sun, Dec 21, 2008 at 08:08:57AM +0900, Darko wrote:> > The right way to do this, IMNSHO is to not assume any particular > > representation of TEXT values and create an implementation interface > > that allows implementations of multiple text representations, much like > > Rd and Wr don't make many assumptions about how data is actually stored > > or retrieved.> > Such an interface may be needed for UTF-8 alone already, anyway,> because within UTF-8 there is in some cases more than one way> to store what amounts to the same data to a human user.> > In Subversion, from the beginning everyone agreed that the internal> encoding for all strings would be UTF-8. Most Subversion APIs expect> data in UTF-8. Strings (e.g. filenames) in the repository are stored> in UTF-8, etc. Great! Will work in all countries! Right?> > Yes, but not on all operating systems if you're not careful!> It did not occur to anyone at the time that there are characters> which in UTF-8 have more than one representation (codepoints) in a> byte stream. For example, an u with umlaut can be encoded as two> bytes or a single byte:> > 2 bytes: [u | the previous character has an umlaut ]> This is called "normal form decomposed".> > 1 byte [u umlaut] (i.e. ü if you can see this on your terminal :)> This is called "normal form composed".> > If you want to be portable, as CM3 and Subversion both try to be,> you have to consider that some operating systems may return your> filenames in a different encoding then you stored it in:> > --------> Accepts Gives back> MacOS X * NFD(*)> Linux * <input>> Windows * <input>> Others ? ?> > > *) There are some remarks to be made regarding full or partial> NFD here, but the essential thing is: If you send in NFC, don't> expect it back!> -------- quoted from:> http://svn.collab.net/repos/svn/trunk/notes/unicode-composition-for-filenames> which is worth a read for more details if you're interested.> > In Subversion, this is a real problem for Mac users, because> two filenames which only differ in their NFC/NFD encoding> look exactly the same to the user (an u umlaut is printed),> while the byte streams do not match ("We're sorry, but your> file x does not exist in the repository!", where x looks just> like a file that is clearly visible in the repository listing :)> > Subversion's problem now is that there are repositories out> there using filenames in either NFC, NFD, or mixed, and there> is no good way to reconcile the mess while staying backwards> compatible with existing clients, servers, working copies and> repositories. So Mac users are told to only use ASCII characters> in their filenames to prevent the problem (many users, especially> users who are not programmers, who store their photos or their> entire home directory or whatever in Subversion, are not happy> about this).> > This problem may not matter as much in case of CM3, but anyone> implementing UTF-8 support for CM3 should be aware of this issue> and not repeat the mistake the Subversion developers made at the> time! With UTF-8, do not rely on a filename to retain its encoding> as you passed it to the OS when requesting the filename from the> OS again.> > CM3 should pick either NFD or NFC as internal UTF-8 encoding, for> filenames only, or for all strings, whichever makes more sense.> And then stick to it, converting input/output as needed.> > Abstracting this problem away using a nice interface would probably> be the cleanest solution.> > Stefan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20081221/d6a2dcd5/attachment-0002.html>


More information about the M3devel mailing list