<html>
<head>
<style><!--
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 12pt;
font-family:Calibri
}
--></style></head>
<body class='hmmessage'><div dir='ltr'><div> > More radically, what current code will break if CHAR is expanded to UTF-32? </div><div> > The language definition would allow that (there is nothing that says BITSIZE(CHAR) == 8). </div><div><br> Philosophy: </div><br> I think you have be careful where you decide something <br> is an abstraction vs. where you keep things unchanged <br> because a lot of code depends on it AND it is adequate, and<br> then introduce new things for new meanings. <BR> <BR> If something is an abstraction, you need to be sure that <br> - all operations people might want to do are supported <br> - breaking through the abstraction boundary is difficult/impossible, <br> such that you maintain the ability to change the implementation later <br> without breaking things <BR> <BR> Abstractions can have value, where you change the implementation <br> and imbue existing code with some new features as a result. <br> Such as ability to work with Unicode.<BR><br> For example, INTEGER is abstract enough, I guess, such that we can widen it. <br> It isn't clearly abstract enough such that we can make overflow <br> raise exceptions, because the existing widely used implementation does not. <BR><br> The size of INTEGER is plain to see to its clients and it is easy for them<br> to be (accidentally) dependent on a particular implementation, but we have likely<br> gotten over that by now, by having a good mix of implementations in use.<br> Eventually we will probably see that code won't work if BITSIZE(INTEGER) = 32.<BR> <BR><br> Another example is that in C++ std::vector<T>::iterator is very much <br> like T*. In fact, it supports an identical feature set, except that it can be a different <br> type and only mix with itself and not T*. <br> In some implementations, it was in fact T* and there was code that mixed them. <br> The implementation was later changed such as to be a unique type and a bunch <br> of code stopped compiling. This is an example where the implementation wasn't opaque <br> enough. Now presumably it is, so further changes won't cause such problems. <br> (You can convert from iterator to pointer just by "&*" and std::vector is guaranteed<br> contiguous, so the breakage was trivial to fix.)<BR> <BR> In C and I thought Modula-3 "char" / "CHAR" means "byte". Exactly 8 bits.<br> I know there might be some wierdo Cray environments where all integer types are really 64 bit doubles, <br> but millions of lines of C/C++ code assumes char is byte. Memory is composed of chars, files <br> are composed of chars. Java and C# "fixed" this, char is 16bits there, and there is a new type "byte" <br> or "int8" or "uint8", but for C and I thought Modula-3 we are stuck with char==8 bit byte and that is ok.<br> (The signedness of char remains reasonably abstract and I think most code is ok either way, but<br> I have seen code that depends on it either way.)<BR> <BR> <BR>Does X have an implied little/bigendian for 16 bit characters?<BR>If it is host, then we should use host.<BR>Windows uses host. Which is pretty much always little (except Xbox 360, and maybe some CE targets?)<BR>We would NOT maintain two forks, swapping and not, no matter what.<BR>We would have a function "SwapWideCharToLittleEndian" or such, written in C,<BR>that would probe the host endian and swap if needed.<BR>The probe would be something like:<BR>int is_little_endian(void) { union { char a[sizeof(int)]; int b; } c = {{1}}; return c.b == 1; }<BR> <BR><br> - Jay<br> <BR><div><hr id="stopSpelling">Date: Mon, 16 Dec 2013 17:42:41 +0100<br>From: estellnb@elstel.rivido.de<br>To: hosking@cs.purdue.edu; jay.krell@cornell.edu<br>CC: m3devel@elegosoft.com; rodney_bates@lcwb.coop<br>Subject: Re: [M3devel] cm3 does not support Scan.LongInt<br><br>
<div class="ecxmoz-cite-prefix">Am 16.12.13 16:00, schrieb Tony
Hosking:<br>
</div>
<blockquote cite="mid:48D9B4D9-0732-4C96-BB77-598987C22D85@cs.purdue.edu">
<div>Jumping in late to this whole conversation (please forgive
any confusion)...</div>
<div><br>
</div>
<div>I hesitate to define ANY M3 builtin type in terms of C/C++
standards.</div>
<div>Regarding WIDECHAR, realize that its definition, like CHAR,
should be in terms of an enumeration containing some (minimal)
number of elements.</div>
<div>The standard says that CHAR contains at least 256 elements.</div>
<div>In M3 enumerations all have a direct mapping to INTEGER.</div>
<div>So, I assume that WIDECHAR would be UTF-32, and TEXT could be
encoded as UTF-8.</div>
<div>More radically, what current code will break if CHAR is
expanded to UTF-32?</div>
<div>The language definition would allow that (there is nothing
that says BITSIZE(CHAR) == 8).</div>
<div><br>
</div>
</blockquote>
Well, if so I could rewrite some code to define as BITS 16 FOR
WIDECHAR as WCHAR.<br>
Perhaps that would be the way to go.<br>
However as Rodney M. Bates has said current WIDECHAR is not BITS 16
for UCHAR.<br>
It uses LE encoding rather than host order encoding a fact which one
could be quite<br>
happy about when it comes to extend Trestle/X11 for widechar
support. So even that<br>
would fail when it came to interface with X11 (or otherwise one
would have to maintain<br>
two branches of code all the time; one that does byte swapping and
one that does not<br>
depending on the host order AND the internally used wchar order
which could then <br>
differ as well.).<br></div> </div></body>
</html>