[M3devel] 64bit INTEGERs, WIDECHAR: language specified or configuration/target dependent?

Elmar Stellnberger estellnb at elstel.org
Wed May 27 17:20:35 CEST 2015


> We use things with varying levels of abstraction. & concrete details are filled in at varying stages. C "long" is NOT specified as having an exact size and does NOT necessarily hold a pointer, and this is ok.
> C99 is pretty complete, messy.
> For each of 8, 16, 32, 64,
> It gives "least", "fast", and "exact" types. And pointer sized. Signed and unsigned. 32 types!


* long in C is not guaranteed to hold a pointer by the language definition
Though not in the language definition (I believe the language definition of C is simply too old for anticipating all the developments from 8/16bit up to 64bit machines)
this has evolved as a de facto standard at least on all x86 up to x86_64 targets as far as I know.
Jay, do you know a single arch / target where sizeof(long) != sizeof(void*)? 


> 
> Whether types like integer have a language-specified or target-dependent
> range is a tough language design question.  I have tended to favor a
> fixed, language-specified range, but there are pros and cons.  I do
> think all the time about end-of-range cases and native word size dependencies.
> It takes a great deal of care, and I know of no way to design a language
> that doesn't, to some extent, trade one set of problems for another.
> Signed/unsigned creates similar language dilemmas.

* language-specified or target-dependent
As I have already suggested I believe there is some justification to have
both kinds of type: a language dependent type and a target dependent one

The advantages of a language dependent type are rather clear when it
comes to mere INTEGERs for numeric/counting/calculation purposes. Once
you have a fixed size you can rely on support of a certain value range. 
Use of the native word size on the other hand should yield better performance 
at least in the general case. If that question were "either - or" I would clearly
favor language specified types. Rewriting certain parts of a program to gain
more performance on a special arch is not as bad as a broken program.

However if we look into the details at least of 8, 16, 32 and 64bit arches
things need to be seen more differentiated:

When extending the address range from 8 to 16 and from 16 to 32bit 
automatically extending the value ranges of integers was a known concern.
65536 (16bit) is not that a high value and we would even have liked to 
extend most array indices to 32bit as we gained something with the value
range while using whole words was also somewhat faster.

However things turned out to be very different when the extension from 
32bit to 64bit was at stake. There was no more gain by automatically
extending the value ranges from 2^32 to 2^64 for 95% of all application
purposes while the memory hierarchy has increasingly become a bottle 
neck in recent time. Additionally doubling the size of all integers would
initially have doubled our memory needs which would have been a
potential drawback for introducing the AMD64 arch. Just think of a machine
with 4GB of RAM: It can not be addressed by 32bit (only ~3GB can) while
making all INTS 64bit would have shrunken our memory to an effective size
of 2GB. Luckily the decision was taken differently this time:

* keep all ints of at most 32bit and just extend pointers to 64bit
 
Elmar

P.S.:  My next email will start a discussion for all these issues.




Am 22.05.2015 um 19:55 schrieb Rodney M. Bates:

> This is a worthwhile discussion, but it has very little to do with using
> llvm as a back end.  In the llvm-IR, sizes of integers, pointers, etc.
> are constant numbers.  The frontend, whether Clang, CM3, or any other, makes
> the decisions about mapping language types like long, INTEGER, pointers,
> etc. to a size, target-dependently or otherwise, according to the language's
> rules.  LLvm does not make these decisions.  Its target dependencies are
> mostly in the line of different code generators for different instruction sets.
> 
> On 05/22/2015 05:53 AM, Elmar Stellnberger wrote:
>> 
>> Am 22.05.2015 um 12:16 schrieb dirk muysers:
>> 
>>> >> What about the said platform dependencies you have discovered?
>>> Not me (I never seriously considered using it), but many people on the llvm
>>> forums pointed to the fact. One example among
>>> many:
>>> 
>>> Does your C code ever use the 'long' type? If so, the LLVM IR will be
>>> different depending on whether it's targeting linux-32 or linux-64. Do
>>> you ever use size_t? Same problem. Do you ever use a union containing
>>> both pointers and integers? See above. In principle, it's possible to
>>> write platform-independent IR, or even C code that compiles to
>>> platform-independent IR. In practice, especially if you include any
>>> system headers, it's remarkably hard.
>>> (Jeffrey Yasskin jyasskin at google.com)
>> 
>> Concerning me I am a very conscientious programmer when it comes to
>> make a difference between long, long long and int. I only use long if my
>> code requires a data item to be exactly as large as a pointer (in special
>> cases also when it comes to tap the power of 64bit machines, f.i. that
>> might be either 32/64bit as a base type for arbitrary length integers;
>> however not without taking special provisions that will tackle the
>> difference in data size. ). Usually aligning the pointers for the next
>> structure at the beginning would also solve such an issue when it comes
>> to reuse existing code where data sizes may not be changed from long
>> either to int or long long without special consideration. Those who use
>> glib f.i. additionally have a g[u]int32/64 which they can use instead of int
>> / long long though that should at last never make a difference for Intel x86
>> based systems. So when it comes to use int or long long I mostly rely
>> on them being either 32 or 64bit.
>> I know that most programmers do not care and just always use long which
>> I consider to be a particularly bad practice. Even in the Linux kernel they
>> have declared "typedef long time_t" instead of "typedef long long time_t"
>> which will create an Y2K mess all over in 2038 for all 32bit machines still
>> in use then. A somehow bad decision which needs to be changed sooner
>> or later even without llvm.
>> 
>> Now let us think of Modula-3. I believe we had a long type for cm3 the last
>> time I have seen it. However an equivalent to long long which does also
>> exist on 32bit platforms would be an absolute requirement to not break
>> things for llvm! Many Thanks for notifying us about this issue, Dirk.
>> 
> 
> Whether types like integer have a language-specified or target-dependent
> range is a tough language design question.  I have tended to favor a
> fixed, language-specified range, but there are pros and cons.  I do
> think all the time about end-of-range cases and native word size dependencies.
> It takes a great deal of care, and I know of no way to design a language
> that doesn't, to some extent, trade one set of problems for another.
> Signed/unsigned creates similar language dilemmas.
> 
>> As far as I can see a Modula-3 programmer will need a good core for
>> portable programming anyway as we did not even uphold a guarantee for
>> WIDECHAR to be either 16 or 32bit.
>> 
> 
> The evolving nature of first UCS and then Unicode standards has left
> many language designers knocked off balance.  Critical Mass first
> introduced WIDECHAR as 16-bit when that was what everybody thought
> was enough.  Then things changed, and it wasn't anymore.  Right now,
> it's a configuration parameter (must be the same for the entire link
> closure) in Modula-3.  I personally favor making it full Unicode
> by default, in the next release, as this is where the world is now.
> This is hopefully a simpler problem than INTEGER, etc., because, as of
> now, the Unicode committee has emphatically assured us that the range will
> *never* increase.  We can hope.




Am 23.05.2015 um 03:58 schrieb Jay:

> We use things with varying levels of abstraction. & concrete details are filled in at varying stages. C "long" is NOT specified as having an exact size and does NOT necessarily hold a pointer, and this is ok.
> 
> 
> C99 is pretty complete, messy.
> For each of 8, 16, 32, 64,
> It gives "least", "fast", and "exact" types. And pointer sized. Signed and unsigned. 32 types!
> 
> 
> INTEGER is always pointer sized.
> LONGINT is on all platforms.
> 
> 
> C# has this feature too. IntPtr & UIntPtr. But C# is dubious -- array indexes are always 32 bit signed. 
> 
> 
> - Jay









More information about the M3devel mailing list