[M3devel] proposal/insistence for fixed size integer types in Ctypes.i3

Sun Jun 1 04:32:27 CEST 2008

Currently the various Utypes.i3 introduce various types LIKE

  uint8_t = unsigned_char;
  uint16_t = unsigned_short;
  uint32_t = unsigned_int;
  uint64_t = unsigned_long_long;

  int8_t = signed_char;
  int16_t = short;
  int32_t = int;
  int64_t = long_long;

sometimes there is an underscore after the u.

There is quite some variation in which, if any, of these types are provided.
When they are provided, they are always the same, with one exception I will detail.

Arguably they are provided only for defining other types and function signatures
within m3-libs/m3core/src/unix.

I strongly strongly strongly propose that at least the above 8 types go in
Ctypes, and the definitions in Utypes removed.

If there was more commonality in Utypes, I'd "forward" them for compatibility,
but there is little commonality. Code depending on these types would have to
be forked a lot. As I said, the types are always the same, if they are defined,
but they are often not defined.

One variation I am open to is introducing a new .i3 file.
But in general I like to colocate stuff rather than pick apart everything
and decide an ideal location. There are tradeoffs either way,
though most people only see the tradeoffs in the way I do it.
The tradeoffs the other way are having to track down module after module,
interface after interface, where to get stuff from, rather than having
a "one stop shop", or "fewer shops to stop".

I am also willing to have u_* types and CAPITALIZED types:

  uint8_t = unsigned_char;
  uint16_t = unsigned_short;
  uint32_t = unsigned_int;
  uint64_t = unsigned_long_long;

  int8_t = signed_char;
  int16_t = short;
  int32_t = int;
  int64_t = long_long;

  u_int8_t = uint8_t;
  u_int16_t = uint16_t;
  u_int32_t = uint32_t;
  u_int64_t = uint64_t;

  UINT8 = uint8_t;
  UINT16 = uint16_t;
  UINT32 = uint32_t;
  UINT64 = uint64_t;

  INT8 = int8_t;
  INT16 = int16_t;
  INT32 = int32_t;
  INT64 = int64_t;

All built-in Modula-3 types are capitalized, as all Modula-3 keywords are.
And capitalized types is a style widely used in the Windows headers.
(Windows and Modula-3 share a common heritage -- Digital -- though I don't know
from where the style of capitalized types originates.)

The names "int8", "int16" are also obvious candidates, but I feel that some
amount of typographical convention should be used to demark types.
Some amount of "Hungarian", if you will.
Obviously there are vehement opposing opinions on this.
"Hungarian" is often too precise and precludes changing types without
changing names, as well as producing unpronouncable names.
A "weak" form however seems reasonable and useful.

These types represent a certain point of view.
It is a common point of view, but not universal.

There are roughly three or four perspectives here:

1)
 char, short, int, long are abstractly defined and all code should live with it.
 char is at least 8 bits, and of unspecified signedness
  (limits.h defines CHAR_BIT, the number of bits in char
   for specified signedness, use signed char or unsigned char;
   I think char has actually three options for its signess -- signed, unsigned, or "half unsigned")
 short is at least 16 bits, signed
 int is at least 16 bits, signed
 long is at least 32 bits, signed

 There are not necessarily integral types that can hold pointers.
 size_t and ptrdiff_t perhaps, but unclear.
 size_t can hold the size of anything, but I think "anything" is "any variable"
   and not necessarily "the entire address space".

 ptrdiff_t can hold the result of subtracting pointers, but it is only
  valid to subtract pointers that point into the same array or just past it.

   It is common, for example, but not universal, for the "address space"
   to be divided between "user mode" and "kernel mode", often with a 50/50 split,
   so therefore size_t could be one bit smaller than a pointer, at least.
   Of course that's an "unnatural" size, but theoretically possible.
   (This kernel/user 50/50 split is usually exactly how 32 bit and I assume
   64 bit Windows works, though 32 bit Windows can also have a 3 gig / 1 gig split,
   and 32 bit Windows code running on 64 bit Windows kernel can get a
   full 4 gig address space.)

 As well, the representation of signed integers is left unspecified.
  The range of "int" need only go down to -32767, not necessarily -32768.
  Signed magnitude and one's complement are valid representations.
  Overflowing a signed integer causes undefined behavior.
  Unsigned numbers do not have this abstraction.

 While this is the "most correct" view, according to (my understanding) the C standard,
   implementations do nail down details way beyond this, and a lot of
   code depends on these details.

 While I may have some of those details slightly wrong, you get the point.
 You CAN write code within this interface, but a lot of code violates it, sometimes

   by accident, sometimes for important practical reasons.
 Some amount of code assumes an int is at least or exactly 32 bits.
 Some amount of code assumes int or long can hold a pointer, though
  int probably not so much, and long probably of proportionally
  rapidly decreasing instance due to Win64.

2)
 char, short, int, long are somewhat abstractly defined
 char is exactly 8 bits
   varying perspectives on its presumed signedness
 short is exactly 16 bits
 int is exactly 32 bits
 long there are few perspectives on; it is exactly 32 bits ("Windows"), or
   it is exactly the size of a pointer ("Unix"), or it is at least
   the size of a pointer

 As well, two's complement is the only representation of signed numbers
   in use, and code depends on this.

 (I recently read that we can thank the IBM S/360 or such, in the 1960's,
  for introducing such modern-day architectural features that everyone
  takes for granted as an 8 bit byte and two's complement signed numbers.)

If you need an integer with a particular exact size, either use char/short/int directly,
or run them through "autoconf", or sniff "limits.h".

3) This is my recently acquired perspective, but it isn't new.

 Given that #1 is "correct but rare", and that #2 are
  full of "exact":

  char, short, int, long are funny names with not particularly
   useful specifications. #2 is a little sleazy (less so if autoconfed/limits.h)
   Unless you are really adhering to the strict spec, don't use them.
   If you are in fact indexing a "small" array, they might suffice,
   but is it worth it? worth having these types?

   Theory: 16 bit machines are irrelevant and 32 bit integers
    are perfectly efficient on 64 bit machines, and 64 bit integers
    are universally available (?) and reasonably efficient (?),
    so feel free to use them if there is a need.

  As well, 4gig remains a large capacity in most contexts, so feel
    free to use explictly 32 bit integers.

  However file sizes and offsets should really always be 64 bits.
    Any code still requiring 32 bit file offsets/sizes is unfortunate.
    That includes PE32+ imho, the file format for .exes/.dlls on Win64.

  Be clear and unsleazy and adopt new names that represent well
    their specification and actual use.

  int_t is exactly n bits in size and signed
  uint_t is exactly n bits in size and unsigned
  some names are chosen for unsigned and signed integers with
   the exact size of a pointer
  For n=8,16,32 all four types exist, and probably 64.
  And pointer-sized types exist.

 If you really feel your capacity limits should scale with address space size, or need
to store a pointer in an integer, use size_t or uintptr_t or intptr_t, etc.

Modula-3's position here adds that INTEGER is the exact
  size of a pointer and signed. It is identical to ptrdiff_t
  or intptr_t. CARDINAL is the exact size but omits the bottom "half"
  of the range, and does not, I believe, extend the top "half".

Now, I also realize, that m3-libs/m3core/src/unix is a fairly mechanical
translation of /usr/include, and /usr/include does not necessarily
take perspective #3. So the "funny" names are useful for a human
mechanical translation. But the precise names can still be used instead.

Here is an exception I said I would detail:

irix-5.2/utypes.i3:
  int64_t      = RECORD val := ARRAY[0..1] OF int32_t {0,0}; END;
  uint64_t     = int64_t;

This is different in at least two ways that I see.
 - default initialization to zero
 - 32 bit alignment instead of 64 bit alignment

I tend to assume that the alignment is actually wrong,
however all the uses in Usignal appear unaffected, as they are always preceded
by a mix of int64_t and an even number of int32.
Either way, it is easy enough to preserve this for compatibility.

I would like to continue, where easy and clear, to reduce the "size" of m3-libs/m3core/src/unix.
Making these types portable available helps that.
For example -- Uin.m3 need not be duplicated at all.
But then it either must use the presently more portable unsigned_short and unsigned,
or uint16_t and uint32_t should be made always available, either by adding them
to all the various Utypes.i3, or the one Ctypes.i3, or a new place.

Darwin currently has four Upthread.i3 files (one is dead), but needs either only two, or one
with the sizes abstracted out. I don't know if PPC64_DARWIN will needs its own yet,
I don't have one of these machines yet.

I would like to go ahead with this stuff *today*.
It takes some exertion of patience for me to stop and send this first. :)

 - Jay