[M3devel] INTEGER

Tue Apr 27 16:40:47 CEST 2010

A general discussion:  (More to come on Mika's specific proposal)

The problem with using subranges and only one integer type lies in the
size of intermediate results of operators.  This is sometimes
important for programmers to manage carefully.  All the more so in the
age of network insecurity, where system crackers exploit boundary
cases that would never occur in legitimate use.

In original Modula-3, this is handled by making all intermediate
results have type INTEGER, even if the operands have subrange types.
This is a good compromise among conflicting needs, with these
properties:

1) It is a simple rule for both compilers and programmers, and it
    satisfies the language properties that every expression has a
    unique type and it does not depend on how the expression is used.

2) Assuming the implementors choose the native word size for INTEGER,
    it is the most efficient arithmetic available.

3) It's the most liberal rule that doesn't have an efficiency penalty.

4) But it doesn't work when a value range greater than the native word
    size is needed for correctness.

The problem arises when the application requirements make 4) relevant.

Making INTEGER wider than the native word size necessarily sacrifices
either 1) or 2).  If all arithmetic were done in, say, double word
size, there would be an efficiency penalty on it all.  Not huge, but
probably a factor of two for the best case of machine and operator.
Integer arithmetic is especially sensitive to this kind of slowdown,
since it is used in so many very basic ways.

Any attempt at defining some system that does intermediate results
conditionally in different sizes inevitably gets messy.  Having two
(or more) integer base types is about as simple as it can get.  It
satisfies 1) and gives a lot more slack before 4) becomes relevant.
It gives programmers control over the efficiency/range tradeoff in
a way that is about as simple to define as possible.

Having different intermediate result sizes with the same base type
gets a lot messier.  To give satisfactory results, it would probably
have to make types of operator expressions depend on how they are
used, violating a fundamental principle of Modula-3.  At best, I think
the rules would be a lot more complicated that just having two integer
base types.

But anyone is welcome to prove me wrong with a proposal.  Just be sure
it completely defines all the type analysis and intermediate result
size cases for the subranges, operators, and constant expressions, and
allows programmers to control which size is used.  And then see if it
is simpler than separate integer types.

Of course, there is always the library approach, like BigInteger  It's
much more flexible and much less efficient in general.  Not to mention
not as syntactically sweet.  Two integer sizes seems like a good
intermediate solution.

Mika Nystrom wrote:
> hendrik at topoi.pooq.com writes:
>> On Fri, Apr 23, 2010 at 11:52:14AM -0700, Mika Nystrom wrote:
>>> Is the infinite extent of the TEXT type a serious implementation difficulty? :-)
>> Yes, actually.  It requires heap storage allocation and indirection.  That's just what 
>> it costs, so we pay it, since fixed-width strings aren't all that useful.
>>
>> But integer ranges of a size that matches the application (instead of the hardware)
>> are quite feasible and useful; I'd rather not pay the proce of dynamic heap allocation 
>> for each integer.
>>
>> -- hendrik
> 
> Ok here I have a proposal, modeled on how Modula-3 does arrays.
> 
> The type specifier LONGINT to be valid in every context where an open
> array is valid.
> 
> A subrange of LONGINT to be treated like a non-open array.
> 
> NEW(REF LONGINT, ..) would take two "size" specifiers, a minimum and a maximum,
> of type LONGINT.
> 
> For L, U, LONGINT literals,
> 
> NEW(REF LONGINT, L, U)
> 
> would produce the same thing as 
> 
> NEW(REF [L..U])
> 
> NEW(REF LONGINT, ..) is needed for dynamically sized LONGINTs.
> 
> And then overload the arithmetic operations, the way arrays and REF RECORDs are, so
> that for b and c REF LONGINT,
> 
> b + c = b^ + c = b + c^ = b^ + c^
> 
> Assignments carry an implicit range check.
> 
> Assignment rules the same as INTEGER and its subranges.
> 
> Use VAL to convert...?
> 
> In short, expressions involving LONGINT have type LONGINT, but no "bare" LONGINT can
> ever be created.  
> 
> However note the following, perfectly legal.
> 
>    WITH a = b + c,
>         r = NEW(REF LONGINT, a, a) DO
>      r^ := a
>    END
> 
> Also legal, note parameters:
> 
> PROCEDURE Collatz(READONLY a : LONGINT; VAR b : LONGINT) =
>   BEGIN
>     CASE a MOD 2L OF
>       0L => b := a DIV 2L
>     | 
>       1L => b := 3L * a + 1L
>     END 
>   END Collatz;
> 
> VALUE a : LONGINT would be discouraged.  Returning LONGINT
> would *not* be legal.  ARRAY OF LONGINT would mean the LONGINTs
> are limited to the same range, and used in NEW would take three
> arguments, one of type INTEGER for the array size and two of type
> LONGINT to specify the LONGINTs' bounds.
> 
> Perhaps...
> 
> PROCEDURE Collatz2(READONLY a : LONGINT; VAR b : LONGINT) 
>   RAISES { Range } =
>   BEGIN
>     CASE a MOD 2L OF
>       0L => b := a DIV 2L
>     | 
>       1L => 
>       WITH new = 3L * a + 1L DO (* is new actually stored anywhere!? *)
>         IF new > LAST(b) THEN 
>           RAISE Range
>         ELSE
>           b := new
>         END
>       END
>     END 
>   END Collatz2;
> 
> My proposal gives you both heap-allocated dynamically sized LONGINT and
> stack-allocated fixed-sized longints and allows you to mix them at will.
> Collatz2 does seem a little bit disturbing, however.  Can we rely on
> alloca "under the covers"?  (Architectures that don't provide alloca
> could heap-allocate new in that example, *if* they need to.)
> 
> Another thing that occurs to me is that while + and - are straightforward,
> *, DIV, and MOD are things you may well want to use very special and
> specific algorithms for in a given application.  It makes sense to let the
> user replace the code for these operations.  And now it begins to look
> like C++.  Sigh.
> 
> Hendrik, I do think WITH causes similar problems even for strictly
> range-limited numbers like the ones you are proposing?  (And so does
> expression evaluation, of course...)
> 
> Now is it even reasonable to have this built in?  What if the user wants
> lazy evaluation?  Are non-lazy LONGINTs very useful?
> 
> Of course, I have to admit I am using a hardware simulation environment
> where non-lazy LONGINTs would indeed be very useful.  
> 
>      Mika
> 
>