[M3devel] M3 programming problem : GC efficiency / per-thread storage areas?

Fri Oct 17 10:03:18 CEST 2008

Ok this suggests that using thread local state to get around the
problem won't help either.

Can I ask a question... I am looking at ThreadPThread.m3...

Why do you have to lock the slotMu in Self()?

PROCEDURE Self (): T =
  (* If not the initial thread and not created by Fork, returns NIL *)
  (* LL = 0 *)
  VAR
    me := GetActivation();
    t: T;
  BEGIN
    IF me = NIL THEN RETURN NIL END;
    WITH r = Upthread.mutex_lock(slotMu) DO <*ASSERT r=0*> END;
      t := slots[me.slot];
    WITH r = Upthread.mutex_unlock(slotMu) DO <*ASSERT r=0*> END;
    IF (t.act # me) THEN Die(ThisLine(), "thread with bad slot!") END;
    RETURN t;
  END Self;

Is it just because of AssignSlots?  If so.. it's actually a very rare
event that there would ever be a conflict, no?  (Only when "slots" is
extended?)

Can data be stored in an "Activation"?  Not TRACED data, obviously, hmm...

     Mika

Tony Hosking writes:
>I suspect part of the overhead of allocation in the new code is the  
>need for thread-local allocation buffers, which means we need to  
>access thread-local state.  We really need an efficient way to do  
>that, but pthreads thread-local accesses may be what is killing you.
>
>On 17 Oct 2008, at 00:30, Mika Nystrom wrote:
>
>> Hi Tony,
>>
>> I figured you would chime in!
>>
>> Yes, @M3noincremental seems to make things consistently a tad bit
>> slower (but a very small difference), on both FreeBSD and Linux.
>> @M3nogc makes a bigger difference, of course.
>>
>> Unfortunately I seem to have lost the code that did a lot of memory
>> allocations.  My tricks (as described in the email---and others!)
>> have removed most of the troublesome memory allocations, but now
>> I'm stuck with the mutex instead...
>>
>>      Mika
>>
>> Tony Hosking writes:
>>> Have you tried running @M3noincremental?
>>>
>>> On 16 Oct 2008, at 23:32, Mika Nystrom wrote:
>>>
>>>> Hello Modula-3 people,
>>>>
>>>> As I mentioned in an earlier email about printing structures (thanks
>>>> Darko), I'm in the midst of coding an interpreter embedded in
>>>> Modula-3.  It's a Scheme interpreter, loosely based on Peter  
>>>> Norvig's
>>>> JScheme for Java (well it was at first strongly based, but more and
>>>> more loosely, if you know what I mean...)
>>>>
>>>> I expected that the performance of the interpreter would be much
>>>> better in Modula-3 than in Java, and I have been testing on two
>>>> different systems.  One is my ancient FreeBSD-4.11 with an old PM3,
>>>> and the other is CM3 on a recent Debian system.  What I am finding
>>>> is that it is indeed much faster than JScheme on FreeBSD/PM3  
>>>> (getting
>>>> close to ten times as fast on some tasks at this point), but on
>>>> Linux/CM3 it is much closer in speed to JScheme than I would like.
>>>>
>>>> When I started, with code that was essentially equivalent to  
>>>> JScheme,
>>>> I found that it was a bit slower than JScheme on Linux/CM3 and
>>>> possibly 2x as fast on FreeBSD/PM3.  On Linux/CM3, it appears to
>>>> spend most of its time in (surprise, surprise!) memory allocation
>>>> and garbage collection.  The speedup I have achieved between the
>>>> first implementation and now was due to the use of Modula-3  
>>>> constructs
>>>> that are superior to Java's, such as the use of arrays of RECORDs
>>>> to make small stacks rather than linked lists.  (I get readable
>>>> code with much fewer memory allocations and GC work.)
>>>>
>>>> Now, since this is an interpreter, I as the implementer have limited
>>>> control over how much memory is allocated and freed, and where it is
>>>> needed.  However, I can sometimes fall back on C-style memory
>>>> management,
>>>> but I would like to do it in a safe way.  For instance, I have
>>>> special-cased
>>>> evaluation of Scheme primitives, as follows.
>>>>
>>>> Under the "normal" implementation, a list of things to evaluate is
>>>> built up, passed to an evaluation function, and then the GC is left
>>>> to sweep up the mess.  The problem is that there are various tricky
>>> routes by which references can escape the evaluator, so you can't
>>>> just assume that what you put in is going to be dead right after
>>>> an eval and free it.  Instead, I set a flag in the evaluator, which
>>>> is TRUE if it is OK to free the list after the eval and FALSE if
>>>> it's unclear (in which case the problem is left up to the GC).
>>>>
>>>> For the vast majority of Scheme primitives, one can indeed free the
>>>> list right after the eval.  Now of course I am not interested
>>>> in unsafe code, so what I do is this:
>>>>
>>>> TYPE Pair = OBJECT first, rest : REFANY; END;
>>>>
>>>> VAR
>>>> mu := NEW(MUTEX);
>>>> free : Pair := NIL;
>>>>
>>>> PROCEDURE GetPair() : Pair =
>>>> BEGIN
>>>>   LOCK mu DO
>>>>     IF free # NIL THEN
>>>>       TRY
>>>>         RETURN free
>>>>       FINALLY
>>>>         free := free.rest
>>>>       END
>>>>     END
>>>>   END;
>>>>   RETURN NEW(Pair)
>>>> END GetPair;
>>>>
>>>> PROCEDURE ReturnPair(cons : Pair) =
>>>> BEGIN
>>>>   cons.first := NIL;
>>>>   LOCK mu DO
>>>>     cons.rest := free;
>>>>     free := cons
>>>>   END
>>>> END ReturnPair;
>>>>
>>>> my eval code looks like
>>>>
>>>> VAR okToFree : BOOLEAN; BEGIN
>>>>
>>>>  args := GetPair(); ...
>>>>  result := EvalPrimitive(args, (*VAR OUT*) okToFree);
>>>>
>>>>  IF okToFree THEN ReturnPair(args) END;
>>>>  RETURN result
>>>> END
>>>>
>>>> and this does work well.  In fact it speeds up the Linux
>>>> implementation
>>>> by almost 100% to recycle the lists like this *just* for the
>>>> evaluation of Scheme primitives.
>>>>
>>>> But it's still ugly, isn't it?  There's a mutex, and a global
>>>> variable.  And yes, the time spent messing with the mutex is
>>>> noticeable, and I haven't even made the code multi-threaded yet
>>>> (and that is coming!)
>>>>
>>>> So I'm thinking, what I really want is a structure that is attached
>>>> to my current Thread.T.  I want to be able to access just a single
>>>> pointer (like the free list) but be sure it is unique to my current
>>>> thread.  No locking would be necessary if I could do this.
>>>>
>>>> Does anyone have an elegant solution that does something like this?
>>>> Thread-specific "static" variables?  Just one REFANY would be enough
>>>> for a lot of uses...  seems to me this should be a frequently
>>>> occurring problem?
>>>>
>>>>    Best regards,
>>>>      Mika
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>