[M3devel] further reducing cloned headers wrt pthread?

Thu Feb 5 00:04:47 CET 2009

On 5 Feb 2009, at 09:11, Jay wrote:

>
>>> I am very leery of this proposal -- the code will be inherently  
>>> opaque
>>> and unmaintainable. I don't see any advantage to it.
>
>
> The entire proposal or the optimizations?

The C allocation of M3 objects, without going through NEW.  There are  
a whole slew of reasons why I'd like to avoid that.

> The original unoptimized proposal seems like a small change mostly.
> I checked and the indirection/heap allocation is already there
> for cond and mutex, but not for pthread_t itself.
> Factoring out the size I think is a small change.

Yes, possibly.  Let me look at it.

> On the other hand, we can also optimize it, pretty much
> locking in the platform-specificity. It's a tough decision to me.
> I don't mind the deoptimizations of const-to-var, or adding
> some function calls, but heap allocs imho are among the things
> to definitely avoid if not needed. These are untraced as well,
> so the argument that Modula-3 heap alloc is efficient doesn't apply.

Right, just that you lose some compiler knowledge of where allocations  
occur.  I have some work I am doing where I analyse code for  
allocation sites.

> One caveat that bothers me though is, like with sem_t,
> I don't want to have types that are declared "incorrectly".
> I'd like types that you can only have references too.
> Probably in that case "give up", declare them as ADDRESS,
> losing the type safety -- pthread_cond_foo could take a mutex_t
> and no compilation error.

Sure.

> The idea of making them all ADDRESS and adding C functions to alloc/ 
> cleanup
> is also good imho. That allows for one of the optimized forms --
> not where the space is at the end of the Thread.T, but where the
> ADDRESS field is the data itself.

Possibly.

> I got hung up on pthread_attr_t here because it was efficiently
> stack allocated and this proposal would have really deoptimized that.
> The C code I showed avoids that though.
> Albeit only in the face of creating a thread -- an extra heap  
> allocation
> per thread create probably not a big deal.

Hmm.

> Clearly I'm ambivalent.

OK, let me think on it.

> Later,
>
> - Jay
>
>
>
>
>
>
>
> ----------------------------------------
>> From: jay.krell at cornell.edu
>> To: hosking at cs.purdue.edu
>> Date: Wed, 4 Feb 2009 09:42:12 +0000
>> CC: m3devel at elegosoft.com
>> Subject: Re: [M3devel] further reducing cloned headers wrt pthread?
>>
>>
>> It gains something, but maybe it isn't enough
>> to be worthwhile. The issue is in the subjectivity.
>>
>>
>> It would remove e.g. the following system-dependent lines:
>>
>>
>> Linux:
>> pthread_t = ADDRESS;
>> pthread_cond_t = RECORD data: ARRAY[1..6] OF LONGINT; END;
>> pthread_key_t = uint32_t;
>>
>>
>> Linux/32:
>> pthread_attr_t = ARRAY[1..9] OF INTEGER;
>> pthread_mutex_t = ARRAY[1..6] OF INTEGER;
>>
>>
>> Linux/64:
>> pthread_attr_t = ARRAY[1..7] OF INTEGER;
>> pthread_mutex_t = ARRAY[1..5] OF INTEGER;
>>
>>
>> FreeBSD:
>> pthread_t = ADDRESS;
>> pthread_attr_t = ADDRESS;
>> pthread_mutex_t = ADDRESS;
>> pthread_cond_t = ADDRESS;
>> pthread_key_t = int;
>>
>>
>> HP-UX:
>> (* trick from darwin-generic/Upthread.i3 *)
>> X32 = ORD(BITSIZE(INTEGER) = 32);
>> X64 = ORD(BITSIZE(INTEGER) = 64);
>> pthread_t = int32_t; (* opaque *)
>> pthread_attr_t = int32_t; (* opaque *)
>> pthread_mutex_t = RECORD opaque: ARRAY [1..11 * X64 + 22 * X32] OF  
>> INTEGER; END; (* 88 opaque bytes with size_t alignment *)
>> pthread_cond_t = RECORD opaque: ARRAY [1..7 * X64 + 14 * X32] OF  
>> INTEGER; END; (* 56 opaque bytes with size_t alignment *)
>> pthread_key_t = int32_t; (* opaque *)
>>
>>
>> Cygwin:
>> pthread_t = ADDRESS; (* opaque *)
>> pthread_attr_t = ADDRESS; (* opaque *)
>> pthread_mutex_t = ADDRESS; (* opaque *)
>> pthread_cond_t = ADDRESS; (* opaque *)
>> pthread_key_t = ADDRESS; (* opaque *)
>>
>>
>> Solaris:
>> pthread_t = int32_t; (* opaque *)
>> pthread_attr_t = int32_t; (* opaque *)
>> pthread_mutex_t = RECORD opaque: ARRAY [1..4] OF LONGINT; END; (*  
>> 32 bytes with 64 bit alignment *)
>> pthread_cond_t = RECORD opaque: ARRAY [1..2] OF LONGINT; END; (* 16  
>> bytes with 64 bit alignment *)
>> pthread_key_t = int32_t; (* opaque *)
>>
>>
>> Darwin: (only ppc32 currently)
>> pthread_t = INTEGER; (* opaque *)
>> pthread_attr_t = RECORD opaque: ARRAY [1..10] OF INTEGER; END;
>> pthread_mutex_t = RECORD opaque: ARRAY [1..11] OF INTEGER; END;
>> pthread_cond_t = RECORD opaque: ARRAY [1..7] OF INTEGER; END;
>> pthread_key_t = INTEGER; (* opaque *)
>>
>>
>> (plus AIX, Irix, VMS, Tru64.)
>>
>>
>> Another approach would be make them all ADDRESS and introduce a  
>> portable
>> C layer of "varything thickness", using the same logic.
>> It would look just like the native pthreads, but there'd be extra  
>> allocate/cleanup
>> calls -- to do the heap alloc/cleanup when the underlying types are  
>> larger than addresses.
>> The two layers would be clear and simple, the cost would be the same,
>> but there would be the conceptual cost of two simple layers instead  
>> of one
>> just one slightly complicated layer.
>>
>>
>> Another approach is maybe make them all addresses on new platforms  
>> and introduce
>> the C layer only on new platforms. Again, about the only change in  
>> the Modula-3
>> code is extra alloc/cleanup calls.
>>
>>
>> And again, some/all of the code already has the indirection/heap  
>> allocation unconditionally.
>>
>>
>> And again, maybe not worth it. I show all the system-dependent  
>> code, attempting
>> to portray in its worst light by showing all of it, but maybe it's  
>> really not a lot.
>>
>>
>> For the attr type, we can do something specific to its use.
>> There is just one use, and we can address it with the following  
>> function written in C..
>> eh..I'll send a diff later tonight/this week I think.
>>
>>
>> pthread_t and pthread_key_t always happen to be address-sized or  
>> smaller.
>> Maybe just declare them both to be address and assert their size in  
>> some C code.
>> That might waste a few bytes esp. on 64 bit platforms, or it might  
>> merely fill in the padding-for-alignment.
>>
>> For example, we have:
>>
>>
>> TYPE
>> Activation = UNTRACED REF RECORD
>> (* global doubly-linked, circular list of all active threads *)
>> next, prev: Activation := NIL; (* LL = activeMu *)
>> (* thread handle *)
>> handle: pthread_t; (* LL = activeMu *)
>> (* base of thread stack for use by GC *)
>> stackbase: ADDRESS := NIL; (* LL = activeMu *)
>>
>>
>> so on 64 bit platforms where pthread_t is a 32bit integer, it is  
>> taking up 64 bits anyway.
>> There are two static pthread_key_ts, so making them address would  
>> waste 8 bytes on some/many 64bit platforms.
>>
>>
>> Leaving only cond and mutex.
>> Some of the platforms declare more types such as rwlock,  
>> rwlockattr, but they are never used.
>> rwlock is a useful type though.
>>
>>
>> - Jay
>>
>>
>> ----------------------------------------
>>> From: hosking at cs.purdue.edu
>>> To: jay.krell at cornell.edu
>>> Date: Wed, 4 Feb 2009 12:53:54 +1100
>>> CC: m3devel at elegosoft.com
>>> Subject: Re: [M3devel] further reducing cloned headers wrt pthread?
>>>
>>> I am very leery of this proposal -- the code will be inherently  
>>> opaque
>>> and unmaintainable. I don't see any advantage to it.
>>>
>>> On 4 Feb 2009, at 11:06, Jay wrote:
>>>
>>>>
>>>> There are a few possibilities:
>>>>
>>>>
>>>> Roughly:
>>>>
>>>> Where there is
>>>>
>>>> INTERFACE Upthread;
>>>>
>>>> TYPE
>>>> pthread_t = ... system specific ...
>>>> pthread_cond_t = ... system specific ...
>>>> pthread_mutex_t = ... system specific ...
>>>>
>>>> PROCEDURE pthread_thread_init_or_whatever(VAR pthread_t);
>>>> PROCEDURE pthread_mutex_init_or_whatever(VAR pthread_mutex_t);
>>>> PROCEDURE pthread_cond_init_or_whatever(VAR pthread_cond_t);
>>>>
>>>> MODULE PThread;
>>>> VAR
>>>> a: pthread_t;
>>>> b: pthread_cond_t;
>>>> c: pthread_mutex_t;
>>>>
>>>> PROCEDURE Foo() =
>>>> BEGIN
>>>> Upthread.pthread_thread_init_or_whatever(a);
>>>> Upthread.pthread_cond_init_or_whatever(b);
>>>> Upthread.pthread_mutex_init_or_whatever(c);
>>>> END Foo;
>>>>
>>>> change to:
>>>>
>>>> INTERFACE Upthread;
>>>>
>>>> TYPE
>>>> pthread_t = RECORD END; or whatever is correct for an opaque
>>>> preferably unique type
>>>> pthread_cond_t = RECORD END; ditto
>>>> pthread_mutex_t = RECORD END; ditto
>>>>
>>>> PROCEDURE pthread_thread_init_or_whatever(VAR pthread_t);
>>>> PROCEDURE pthread_mutex_init_or_whatever(VAR pthread_mutex_t);
>>>> PROCEDURE pthread_cond_init_or_whatever(VAR pthread_cond_t);
>>>>
>>>>
>>>> INTERFACE PThreadC.i3
>>>>
>>>> PROCEDURE GetA(): UNTRACED REF Upthread.thread_t;
>>>> PROCEDURE GetB(): UNTRACED REF Upthread.thread_cond_t;
>>>> PROCEDURE GetC(): UNTRACED REF Upthread.thread_mutex_t;
>>>>
>>>> or possibly extern VAR
>>>>
>>>> PThreadC.c
>>>>
>>>> static pthread_t a = PTHREAD_INIT;
>>>> static pthread_cond_t b = PTHREAD_COND_INIT;
>>>> static pthread_mutex_t c = PTHREAD_MUTEX_INIT;
>>>>
>>>> pthread_t* GetA() { return &a; }
>>>>
>>>> pthread_cond_t* GetB() { return &b; }
>>>>
>>>> pthread_mutex_t* GetC() { return &c; }
>>>>
>>>> MODULE PThread;
>>>> VAR
>>>> a := PThreadC.GetA();
>>>> b := PThreadC.GetB();
>>>> c := PThreadC.GetA();
>>>>
>>>> PROCEDURE Foo() =
>>>> BEGIN
>>>> Upthread.pthread_thread_init_or_whatever(a^);
>>>> Upthread.pthread_cond_init_or_whatever(b^);
>>>> Upthread.pthread_mutex_init_or_whatever(c^);
>>>> END Foo;
>>>>
>>>> or, again, possibly they are variables and it goes a little  
>>>> smaller/
>>>> quicker:
>>>>
>>>> FROM UPthreadC IMPORT a, b, c;
>>>>
>>>>
>>>> PROCEDURE Foo() =
>>>> BEGIN
>>>> Upthread.pthread_thread_init_or_whatever(a);
>>>> Upthread.pthread_cond_init_or_whatever(b);
>>>> Upthread.pthread_mutex_init_or_whatever(c);
>>>> END Foo;
>>>>
>>>> I think that is pretty cut and dry, no controversy.
>>>>
>>>> What is less clear is what to do with non-statically allocated
>>>> variables.
>>>>
>>>> Let's say:
>>>>
>>>> MODULE PThread;
>>>>
>>>> TYPE T = RECORD
>>>> a:int;
>>>> b:pthread_t;
>>>> END;
>>>>
>>>> PROCEDURE CreateT():T=
>>>> VAR
>>>> t := NEW(T)
>>>> BEGIN
>>>> Upthread.init_or_whatever(t.b);
>>>> RETURN t;
>>>> END;
>>>>
>>>> PROCEDURE DisposeT(t:T)=
>>>> BEGIN
>>>> IF t = NIL THEN RETURN END;
>>>> Upthread.pthread_cleanup_or_whatever(t.b);
>>>> DISPOSE(t);
>>>> END;
>>>>
>>>> The desire is something that does not know the size of pthread_t,
>>>> something like:
>>>>
>>>> TYPE T = RECORD
>>>> a:int;
>>>> b:UNTRACED REF pthread_t;
>>>> END;
>>>>
>>>>
>>>> PROCEDURE CreateT():T=
>>>> VAR
>>>> t := NEW(T);
>>>> BEGIN
>>>> t.b := LOOPHOLE(UNTRACED REF pthread_t, NEW(UNTRACED REF ARRAY OF
>>>> CHAR, Upthread.pthread_t_size));
>>>> (* Though I really wanted t.b :=
>>>> RTAllocator.MallocZeroed(Upthread.pthread_t_size); *)
>>>> Upthread.init_or_whatever(t.b^);
>>>> RETURN t;
>>>> END;
>>>>
>>>> PROCEDURE DisposeT(t:T)=
>>>> BEGIN
>>>> IF t = NIL THEN RETURN END;
>>>> Upthread.pthread_cleanup_or_whatever(t.b^);
>>>> DISPOSE(t.b);
>>>> DISPOSE(t);
>>>> END;
>>>>
>>>>
>>>> However that incurs an extra heap allocation, which is not great.
>>>> In at least one place, the pointer-indirection-and-heap-allocation
>>>> is already there
>>>> so this isn't a deoptimization. However "reoptimizing" it might be
>>>> nice.
>>>>
>>>>
>>>> What I would prefer a pattern I often use in C -- merging
>>>> allocations, something like,
>>>> /assuming/ t is untraced, which I grant it might not be.
>>>>
>>>>
>>>> And ensuring that BYTESIZE(T) is properly aligned:
>>>>
>>>>
>>>> PROCEDURE CreateT():UNTRACED REF T=
>>>> VAR
>>>> p : ADDRESS;
>>>> t : UNTRACED REF T;
>>>> BEGIN
>>>> (* Again I would prefer RTAllocator.MallocZeroed *)
>>>> p := NEW(UNTRACED REF ARRAY OF CHAR, Upthread.pthread_t_size +
>>>> BYTESIZE(T)));
>>>> t := LOOPHOLE(UNTRACED REF T, p);
>>>> t.b := LOOPHOLE(UNTRACED REF Upthread.pthread_t, p + BYTESIZE(T));
>>>> Upthread.init_or_whatever(t.b^);
>>>> RETURN t;
>>>> END;
>>>>
>>>>
>>>> That is -- opaque types, size not known at compile-time, but size
>>>> known at runtime, and
>>>> do not incur an extra heap allocation for lack of knowing sizes at
>>>> compile-time.
>>>>
>>>>
>>>> For the statically allocated variables I think there is no
>>>> controversy.
>>>> There might a tiny bit of overhead in the use, but it'd be very
>>>> small, and possibly
>>>> even removable in the future. I'd rather avoid the variables, as
>>>> all writable
>>>> data is to be avoided. Read only pages are better and all that,
>>>> but ok..
>>>>
>>>>
>>>> However the value is mainly realized only if statically and
>>>> dynamically allocated variables are handled.
>>>>
>>>> The result of this would be further reduction in platform-
>>>> specificity when cloning
>>>> C headers into Modula-3 interfaces. i.e. less work to bring up new
>>>> platforms.
>>>>
>>>>
>>>> - Jay
>>>>
>>>>
>>>> ----------------------------------------
>>>>> From: hosking at cs.purdue.edu
>>>>> To: jay.krell at cornell.edu
>>>>> Date: Wed, 4 Feb 2009 09:54:01 +1100
>>>>> CC: m3devel at elegosoft.com
>>>>> Subject: Re: [M3devel] further reducing cloned headers wrt  
>>>>> pthread?
>>>>>
>>>>> I suggest you come up with a proposal for us to look over before  
>>>>> you
>>>>> change the code base for this.
>>>>>
>>>>> On 4 Feb 2009, at 09:05, Jay wrote:
>>>>>
>>>>>>
>>>>>>> Hmm, yes, you are right that there is a possible alignment  
>>>>>>> issue. I
>>>>>>> am used to pthread_mutext_t being a simple reference. But surely
>>>>>>> in C
>>>>>>> the type of the pthread_mutex_t struct would have appropriate
>>>>>>> alignment padding anyway so as to allow allocation using
>>>>>>> malloc(sizeof
>>>>>>> pthread_mutex_t)? So, it all should just work right?
>>>>>>
>>>>>>
>>>>>> I think "the other way around" and same conclusion.
>>>>>> malloc should return something "maximally aligned" so that
>>>>>>
>>>>>> pthread_mutex_t* x = (pthread_mutex_t*)
>>>>>> malloc(sizeof(pthread_mutex_t));
>>>>>>
>>>>>>
>>>>>> works. pthread_mutex_t doesn't need the padding, malloc does,  
>>>>>> so to
>>>>>> speak.
>>>>>>
>>>>>>
>>>>>> Just as long as we don't have
>>>>>>
>>>>>>
>>>>>> TYPE Foo = RECORD
>>>>>> a: pthread_mutex_t;
>>>>>> b: pthread_mutex_t;
>>>>>> c: pthread_t;
>>>>>> d: pthread_t;
>>>>>> e: pthread_cond_t;
>>>>>> f: pthread_cond_t;
>>>>>> END;
>>>>>>
>>>>>>
>>>>>> and such, ok.
>>>>>>
>>>>>>
>>>>>> malloc on NT returns something with 2 * sizeof(void*) alignment.
>>>>>> I think on Win9x only 4 alignment, thus there is  
>>>>>> _malloc_aligned for
>>>>>> dealing with SSE stuff.
>>>>>> Something like that.
>>>>>>
>>>>>>
>>>>>> I didn't realize untraced allocations were basically just  
>>>>>> malloc but
>>>>>> indeed they are.
>>>>>>
>>>>>>
>>>>>> I'm still mulling over the possible deoptimizations here.
>>>>>> I'm reluctant to increase heap allocations.
>>>>>>
>>>>>>
>>>>>>
>>>>>> - Jay
>>>>>
>>>