[M3devel] M3 programming problem : GC efficiency / per-thread storage areas?

Tue Oct 21 16:54:58 CEST 2008

I have one more question that I forgot to ask before.  Did you  
evaluate performance with -O3 optimization in the backend?

Generally, I have the following in my m3_backend specs so that turning  
on optimization results in -O3 (and lots of lovely inlining):

proc m3_backend (source, object, optimize, debug) is
   local args =
   [
     "-m32",
     "-quiet",
     source,
     "-o",
     object,
     % fPIC really is needed here, despite man gcc saying it is the  
default.
     % This is because man gcc is about Apple's gcc but m3cg is
     % built from FSF source.
     "-fPIC",
     "-fno-reorder-blocks"
   ]
   if optimize  args += "-O3"  end
   if debug     args += "-gstabs"  end
   if M3_PROFILING args += "-p" end
   return try_exec (m3back, args)
end

On 17 Oct 2008, at 09:32, Mika Nystrom wrote:

> Ok I am sorry I am slow to pick up on this.
>
> I take it the problem is actually the Upthread.getspecific routine,
> which itself calls something get_curthread somewhere inside pthreads,
> which in turn involves a context switch to the supervisor---the  
> identity
> of the current thread is just not accessible anywhere in user space.
> Also explains why this program runs faster with my old PM3, which uses
> longjmp threads.
>
> The only way to avoid it (really) is to pass a pointer to the
> Thread.T of the currently executing thread in the activation record
> of *every* procedure, so that allocators can find it when  
> necessary....
> but that is very expensive in terms of stack memory.
>
> Or I can just make a structure like that that I pass around where
> I need it in my own program.  Thread-specific and user-managed.
>
> I believe I have just answered all my own questions, but I hope
> Tony will correct me if my answers are incorrect.
>
>    Mika
>
> Tony Hosking writes:
>> I suspect part of the overhead of allocation in the new code is the
>> need for thread-local allocation buffers, which means we need to
>> access thread-local state.  We really need an efficient way to do
>> that, but pthreads thread-local accesses may be what is killing you.
>>
>> On 17 Oct 2008, at 00:30, Mika Nystrom wrote:
>>
>>> Hi Tony,
>>>
>>> I figured you would chime in!
>>>
>>> Yes, @M3noincremental seems to make things consistently a tad bit
>>> slower (but a very small difference), on both FreeBSD and Linux.
>>> @M3nogc makes a bigger difference, of course.
>>>
>>> Unfortunately I seem to have lost the code that did a lot of memory
>>> allocations.  My tricks (as described in the email---and others!)
>>> have removed most of the troublesome memory allocations, but now
>>> I'm stuck with the mutex instead...
>>>
>>>     Mika
>>>
>>> Tony Hosking writes:
>>>> Have you tried running @M3noincremental?
>>>>
>>>> On 16 Oct 2008, at 23:32, Mika Nystrom wrote:
>>>>
>>>>> Hello Modula-3 people,
>>>>>
>>>>> As I mentioned in an earlier email about printing structures  
>>>>> (thanks
>>>>> Darko), I'm in the midst of coding an interpreter embedded in
>>>>> Modula-3.  It's a Scheme interpreter, loosely based on Peter
>>>>> Norvig's
>>>>> JScheme for Java (well it was at first strongly based, but more  
>>>>> and
>>>>> more loosely, if you know what I mean...)
>>>>>
>>>>> I expected that the performance of the interpreter would be much
>>>>> better in Modula-3 than in Java, and I have been testing on two
>>>>> different systems.  One is my ancient FreeBSD-4.11 with an old  
>>>>> PM3,
>>>>> and the other is CM3 on a recent Debian system.  What I am finding
>>>>> is that it is indeed much faster than JScheme on FreeBSD/PM3
>>>>> (getting
>>>>> close to ten times as fast on some tasks at this point), but on
>>>>> Linux/CM3 it is much closer in speed to JScheme than I would like.
>>>>>
>>>>> When I started, with code that was essentially equivalent to
>>>>> JScheme,
>>>>> I found that it was a bit slower than JScheme on Linux/CM3 and
>>>>> possibly 2x as fast on FreeBSD/PM3.  On Linux/CM3, it appears to
>>>>> spend most of its time in (surprise, surprise!) memory allocation
>>>>> and garbage collection.  The speedup I have achieved between the
>>>>> first implementation and now was due to the use of Modula-3
>>>>> constructs
>>>>> that are superior to Java's, such as the use of arrays of RECORDs
>>>>> to make small stacks rather than linked lists.  (I get readable
>>>>> code with much fewer memory allocations and GC work.)
>>>>>
>>>>> Now, since this is an interpreter, I as the implementer have  
>>>>> limited
>>>>> control over how much memory is allocated and freed, and where  
>>>>> it is
>>>>> needed.  However, I can sometimes fall back on C-style memory
>>>>> management,
>>>>> but I would like to do it in a safe way.  For instance, I have
>>>>> special-cased
>>>>> evaluation of Scheme primitives, as follows.
>>>>>
>>>>> Under the "normal" implementation, a list of things to evaluate is
>>>>> built up, passed to an evaluation function, and then the GC is  
>>>>> left
>>>>> to sweep up the mess.  The problem is that there are various  
>>>>> tricky
>>>> routes by which references can escape the evaluator, so you can't
>>>>> just assume that what you put in is going to be dead right after
>>>>> an eval and free it.  Instead, I set a flag in the evaluator,  
>>>>> which
>>>>> is TRUE if it is OK to free the list after the eval and FALSE if
>>>>> it's unclear (in which case the problem is left up to the GC).
>>>>>
>>>>> For the vast majority of Scheme primitives, one can indeed free  
>>>>> the
>>>>> list right after the eval.  Now of course I am not interested
>>>>> in unsafe code, so what I do is this:
>>>>>
>>>>> TYPE Pair = OBJECT first, rest : REFANY; END;
>>>>>
>>>>> VAR
>>>>> mu := NEW(MUTEX);
>>>>> free : Pair := NIL;
>>>>>
>>>>> PROCEDURE GetPair() : Pair =
>>>>> BEGIN
>>>>>  LOCK mu DO
>>>>>    IF free # NIL THEN
>>>>>      TRY
>>>>>        RETURN free
>>>>>      FINALLY
>>>>>        free := free.rest
>>>>>      END
>>>>>    END
>>>>>  END;
>>>>>  RETURN NEW(Pair)
>>>>> END GetPair;
>>>>>
>>>>> PROCEDURE ReturnPair(cons : Pair) =
>>>>> BEGIN
>>>>>  cons.first := NIL;
>>>>>  LOCK mu DO
>>>>>    cons.rest := free;
>>>>>    free := cons
>>>>>  END
>>>>> END ReturnPair;
>>>>>
>>>>> my eval code looks like
>>>>>
>>>>> VAR okToFree : BOOLEAN; BEGIN
>>>>>
>>>>> args := GetPair(); ...
>>>>> result := EvalPrimitive(args, (*VAR OUT*) okToFree);
>>>>>
>>>>> IF okToFree THEN ReturnPair(args) END;
>>>>> RETURN result
>>>>> END
>>>>>
>>>>> and this does work well.  In fact it speeds up the Linux
>>>>> implementation
>>>>> by almost 100% to recycle the lists like this *just* for the
>>>>> evaluation of Scheme primitives.
>>>>>
>>>>> But it's still ugly, isn't it?  There's a mutex, and a global
>>>>> variable.  And yes, the time spent messing with the mutex is
>>>>> noticeable, and I haven't even made the code multi-threaded yet
>>>>> (and that is coming!)
>>>>>
>>>>> So I'm thinking, what I really want is a structure that is  
>>>>> attached
>>>>> to my current Thread.T.  I want to be able to access just a single
>>>>> pointer (like the free list) but be sure it is unique to my  
>>>>> current
>>>>> thread.  No locking would be necessary if I could do this.
>>>>>
>>>>> Does anyone have an elegant solution that does something like  
>>>>> this?
>>>>> Thread-specific "static" variables?  Just one REFANY would be  
>>>>> enough
>>>>> for a lot of uses...  seems to me this should be a frequently
>>>>> occurring problem?
>>>>>
>>>>>   Best regards,
>>>>>     Mika
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>