[M3devel] M3 programming problem : GC efficiency / per-thread storage areas?

Fri Oct 17 10:32:28 CEST 2008

Ok I am sorry I am slow to pick up on this.

I take it the problem is actually the Upthread.getspecific routine,
which itself calls something get_curthread somewhere inside pthreads,
which in turn involves a context switch to the supervisor---the identity
of the current thread is just not accessible anywhere in user space.
Also explains why this program runs faster with my old PM3, which uses
longjmp threads.

The only way to avoid it (really) is to pass a pointer to the
Thread.T of the currently executing thread in the activation record
of *every* procedure, so that allocators can find it when necessary....
but that is very expensive in terms of stack memory.

Or I can just make a structure like that that I pass around where
I need it in my own program.  Thread-specific and user-managed.

I believe I have just answered all my own questions, but I hope
Tony will correct me if my answers are incorrect.

    Mika

Tony Hosking writes:
>I suspect part of the overhead of allocation in the new code is the  
>need for thread-local allocation buffers, which means we need to  
>access thread-local state.  We really need an efficient way to do  
>that, but pthreads thread-local accesses may be what is killing you.
>
>On 17 Oct 2008, at 00:30, Mika Nystrom wrote:
>
>> Hi Tony,
>>
>> I figured you would chime in!
>>
>> Yes, @M3noincremental seems to make things consistently a tad bit
>> slower (but a very small difference), on both FreeBSD and Linux.
>> @M3nogc makes a bigger difference, of course.
>>
>> Unfortunately I seem to have lost the code that did a lot of memory
>> allocations.  My tricks (as described in the email---and others!)
>> have removed most of the troublesome memory allocations, but now
>> I'm stuck with the mutex instead...
>>
>>      Mika
>>
>> Tony Hosking writes:
>>> Have you tried running @M3noincremental?
>>>
>>> On 16 Oct 2008, at 23:32, Mika Nystrom wrote:
>>>
>>>> Hello Modula-3 people,
>>>>
>>>> As I mentioned in an earlier email about printing structures (thanks
>>>> Darko), I'm in the midst of coding an interpreter embedded in
>>>> Modula-3.  It's a Scheme interpreter, loosely based on Peter  
>>>> Norvig's
>>>> JScheme for Java (well it was at first strongly based, but more and
>>>> more loosely, if you know what I mean...)
>>>>
>>>> I expected that the performance of the interpreter would be much
>>>> better in Modula-3 than in Java, and I have been testing on two
>>>> different systems.  One is my ancient FreeBSD-4.11 with an old PM3,
>>>> and the other is CM3 on a recent Debian system.  What I am finding
>>>> is that it is indeed much faster than JScheme on FreeBSD/PM3  
>>>> (getting
>>>> close to ten times as fast on some tasks at this point), but on
>>>> Linux/CM3 it is much closer in speed to JScheme than I would like.
>>>>
>>>> When I started, with code that was essentially equivalent to  
>>>> JScheme,
>>>> I found that it was a bit slower than JScheme on Linux/CM3 and
>>>> possibly 2x as fast on FreeBSD/PM3.  On Linux/CM3, it appears to
>>>> spend most of its time in (surprise, surprise!) memory allocation
>>>> and garbage collection.  The speedup I have achieved between the
>>>> first implementation and now was due to the use of Modula-3  
>>>> constructs
>>>> that are superior to Java's, such as the use of arrays of RECORDs
>>>> to make small stacks rather than linked lists.  (I get readable
>>>> code with much fewer memory allocations and GC work.)
>>>>
>>>> Now, since this is an interpreter, I as the implementer have limited
>>>> control over how much memory is allocated and freed, and where it is
>>>> needed.  However, I can sometimes fall back on C-style memory
>>>> management,
>>>> but I would like to do it in a safe way.  For instance, I have
>>>> special-cased
>>>> evaluation of Scheme primitives, as follows.
>>>>
>>>> Under the "normal" implementation, a list of things to evaluate is
>>>> built up, passed to an evaluation function, and then the GC is left
>>>> to sweep up the mess.  The problem is that there are various tricky
>>> routes by which references can escape the evaluator, so you can't
>>>> just assume that what you put in is going to be dead right after
>>>> an eval and free it.  Instead, I set a flag in the evaluator, which
>>>> is TRUE if it is OK to free the list after the eval and FALSE if
>>>> it's unclear (in which case the problem is left up to the GC).
>>>>
>>>> For the vast majority of Scheme primitives, one can indeed free the
>>>> list right after the eval.  Now of course I am not interested
>>>> in unsafe code, so what I do is this:
>>>>
>>>> TYPE Pair = OBJECT first, rest : REFANY; END;
>>>>
>>>> VAR
>>>> mu := NEW(MUTEX);
>>>> free : Pair := NIL;
>>>>
>>>> PROCEDURE GetPair() : Pair =
>>>> BEGIN
>>>>   LOCK mu DO
>>>>     IF free # NIL THEN
>>>>       TRY
>>>>         RETURN free
>>>>       FINALLY
>>>>         free := free.rest
>>>>       END
>>>>     END
>>>>   END;
>>>>   RETURN NEW(Pair)
>>>> END GetPair;
>>>>
>>>> PROCEDURE ReturnPair(cons : Pair) =
>>>> BEGIN
>>>>   cons.first := NIL;
>>>>   LOCK mu DO
>>>>     cons.rest := free;
>>>>     free := cons
>>>>   END
>>>> END ReturnPair;
>>>>
>>>> my eval code looks like
>>>>
>>>> VAR okToFree : BOOLEAN; BEGIN
>>>>
>>>>  args := GetPair(); ...
>>>>  result := EvalPrimitive(args, (*VAR OUT*) okToFree);
>>>>
>>>>  IF okToFree THEN ReturnPair(args) END;
>>>>  RETURN result
>>>> END
>>>>
>>>> and this does work well.  In fact it speeds up the Linux
>>>> implementation
>>>> by almost 100% to recycle the lists like this *just* for the
>>>> evaluation of Scheme primitives.
>>>>
>>>> But it's still ugly, isn't it?  There's a mutex, and a global
>>>> variable.  And yes, the time spent messing with the mutex is
>>>> noticeable, and I haven't even made the code multi-threaded yet
>>>> (and that is coming!)
>>>>
>>>> So I'm thinking, what I really want is a structure that is attached
>>>> to my current Thread.T.  I want to be able to access just a single
>>>> pointer (like the free list) but be sure it is unique to my current
>>>> thread.  No locking would be necessary if I could do this.
>>>>
>>>> Does anyone have an elegant solution that does something like this?
>>>> Thread-specific "static" variables?  Just one REFANY would be enough
>>>> for a lot of uses...  seems to me this should be a frequently
>>>> occurring problem?
>>>>
>>>>    Best regards,
>>>>      Mika
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>