[M3devel] M3 programming problem : GC efficiency / per-thread storage areas?

Tue Oct 21 17:17:24 CEST 2008

Also, turn off assertions.

On 21 Oct 2008, at 15:54, Tony Hosking wrote:

> I have one more question that I forgot to ask before.  Did you  
> evaluate performance with -O3 optimization in the backend?
>
> Generally, I have the following in my m3_backend specs so that  
> turning on optimization results in -O3 (and lots of lovely inlining):
>
> proc m3_backend (source, object, optimize, debug) is
>  local args =
>  [
>    "-m32",
>    "-quiet",
>    source,
>    "-o",
>    object,
>    % fPIC really is needed here, despite man gcc saying it is the  
> default.
>    % This is because man gcc is about Apple's gcc but m3cg is
>    % built from FSF source.
>    "-fPIC",
>    "-fno-reorder-blocks"
>  ]
>  if optimize  args += "-O3"  end
>  if debug     args += "-gstabs"  end
>  if M3_PROFILING args += "-p" end
>  return try_exec (m3back, args)
> end
>
>
> On 17 Oct 2008, at 09:32, Mika Nystrom wrote:
>
>> Ok I am sorry I am slow to pick up on this.
>>
>> I take it the problem is actually the Upthread.getspecific routine,
>> which itself calls something get_curthread somewhere inside pthreads,
>> which in turn involves a context switch to the supervisor---the  
>> identity
>> of the current thread is just not accessible anywhere in user space.
>> Also explains why this program runs faster with my old PM3, which  
>> uses
>> longjmp threads.
>>
>> The only way to avoid it (really) is to pass a pointer to the
>> Thread.T of the currently executing thread in the activation record
>> of *every* procedure, so that allocators can find it when  
>> necessary....
>> but that is very expensive in terms of stack memory.
>>
>> Or I can just make a structure like that that I pass around where
>> I need it in my own program.  Thread-specific and user-managed.
>>
>> I believe I have just answered all my own questions, but I hope
>> Tony will correct me if my answers are incorrect.
>>
>>   Mika
>>
>> Tony Hosking writes:
>>> I suspect part of the overhead of allocation in the new code is the
>>> need for thread-local allocation buffers, which means we need to
>>> access thread-local state.  We really need an efficient way to do
>>> that, but pthreads thread-local accesses may be what is killing you.
>>>
>>> On 17 Oct 2008, at 00:30, Mika Nystrom wrote:
>>>
>>>> Hi Tony,
>>>>
>>>> I figured you would chime in!
>>>>
>>>> Yes, @M3noincremental seems to make things consistently a tad bit
>>>> slower (but a very small difference), on both FreeBSD and Linux.
>>>> @M3nogc makes a bigger difference, of course.
>>>>
>>>> Unfortunately I seem to have lost the code that did a lot of memory
>>>> allocations.  My tricks (as described in the email---and others!)
>>>> have removed most of the troublesome memory allocations, but now
>>>> I'm stuck with the mutex instead...
>>>>
>>>>    Mika
>>>>
>>>> Tony Hosking writes:
>>>>> Have you tried running @M3noincremental?
>>>>>
>>>>> On 16 Oct 2008, at 23:32, Mika Nystrom wrote:
>>>>>
>>>>>> Hello Modula-3 people,
>>>>>>
>>>>>> As I mentioned in an earlier email about printing structures  
>>>>>> (thanks
>>>>>> Darko), I'm in the midst of coding an interpreter embedded in
>>>>>> Modula-3.  It's a Scheme interpreter, loosely based on Peter
>>>>>> Norvig's
>>>>>> JScheme for Java (well it was at first strongly based, but more  
>>>>>> and
>>>>>> more loosely, if you know what I mean...)
>>>>>>
>>>>>> I expected that the performance of the interpreter would be much
>>>>>> better in Modula-3 than in Java, and I have been testing on two
>>>>>> different systems.  One is my ancient FreeBSD-4.11 with an old  
>>>>>> PM3,
>>>>>> and the other is CM3 on a recent Debian system.  What I am  
>>>>>> finding
>>>>>> is that it is indeed much faster than JScheme on FreeBSD/PM3
>>>>>> (getting
>>>>>> close to ten times as fast on some tasks at this point), but on
>>>>>> Linux/CM3 it is much closer in speed to JScheme than I would  
>>>>>> like.
>>>>>>
>>>>>> When I started, with code that was essentially equivalent to
>>>>>> JScheme,
>>>>>> I found that it was a bit slower than JScheme on Linux/CM3 and
>>>>>> possibly 2x as fast on FreeBSD/PM3.  On Linux/CM3, it appears to
>>>>>> spend most of its time in (surprise, surprise!) memory allocation
>>>>>> and garbage collection.  The speedup I have achieved between the
>>>>>> first implementation and now was due to the use of Modula-3
>>>>>> constructs
>>>>>> that are superior to Java's, such as the use of arrays of RECORDs
>>>>>> to make small stacks rather than linked lists.  (I get readable
>>>>>> code with much fewer memory allocations and GC work.)
>>>>>>
>>>>>> Now, since this is an interpreter, I as the implementer have  
>>>>>> limited
>>>>>> control over how much memory is allocated and freed, and where  
>>>>>> it is
>>>>>> needed.  However, I can sometimes fall back on C-style memory
>>>>>> management,
>>>>>> but I would like to do it in a safe way.  For instance, I have
>>>>>> special-cased
>>>>>> evaluation of Scheme primitives, as follows.
>>>>>>
>>>>>> Under the "normal" implementation, a list of things to evaluate  
>>>>>> is
>>>>>> built up, passed to an evaluation function, and then the GC is  
>>>>>> left
>>>>>> to sweep up the mess.  The problem is that there are various  
>>>>>> tricky
>>>>> routes by which references can escape the evaluator, so you can't
>>>>>> just assume that what you put in is going to be dead right after
>>>>>> an eval and free it.  Instead, I set a flag in the evaluator,  
>>>>>> which
>>>>>> is TRUE if it is OK to free the list after the eval and FALSE if
>>>>>> it's unclear (in which case the problem is left up to the GC).
>>>>>>
>>>>>> For the vast majority of Scheme primitives, one can indeed free  
>>>>>> the
>>>>>> list right after the eval.  Now of course I am not interested
>>>>>> in unsafe code, so what I do is this:
>>>>>>
>>>>>> TYPE Pair = OBJECT first, rest : REFANY; END;
>>>>>>
>>>>>> VAR
>>>>>> mu := NEW(MUTEX);
>>>>>> free : Pair := NIL;
>>>>>>
>>>>>> PROCEDURE GetPair() : Pair =
>>>>>> BEGIN
>>>>>> LOCK mu DO
>>>>>>   IF free # NIL THEN
>>>>>>     TRY
>>>>>>       RETURN free
>>>>>>     FINALLY
>>>>>>       free := free.rest
>>>>>>     END
>>>>>>   END
>>>>>> END;
>>>>>> RETURN NEW(Pair)
>>>>>> END GetPair;
>>>>>>
>>>>>> PROCEDURE ReturnPair(cons : Pair) =
>>>>>> BEGIN
>>>>>> cons.first := NIL;
>>>>>> LOCK mu DO
>>>>>>   cons.rest := free;
>>>>>>   free := cons
>>>>>> END
>>>>>> END ReturnPair;
>>>>>>
>>>>>> my eval code looks like
>>>>>>
>>>>>> VAR okToFree : BOOLEAN; BEGIN
>>>>>>
>>>>>> args := GetPair(); ...
>>>>>> result := EvalPrimitive(args, (*VAR OUT*) okToFree);
>>>>>>
>>>>>> IF okToFree THEN ReturnPair(args) END;
>>>>>> RETURN result
>>>>>> END
>>>>>>
>>>>>> and this does work well.  In fact it speeds up the Linux
>>>>>> implementation
>>>>>> by almost 100% to recycle the lists like this *just* for the
>>>>>> evaluation of Scheme primitives.
>>>>>>
>>>>>> But it's still ugly, isn't it?  There's a mutex, and a global
>>>>>> variable.  And yes, the time spent messing with the mutex is
>>>>>> noticeable, and I haven't even made the code multi-threaded yet
>>>>>> (and that is coming!)
>>>>>>
>>>>>> So I'm thinking, what I really want is a structure that is  
>>>>>> attached
>>>>>> to my current Thread.T.  I want to be able to access just a  
>>>>>> single
>>>>>> pointer (like the free list) but be sure it is unique to my  
>>>>>> current
>>>>>> thread.  No locking would be necessary if I could do this.
>>>>>>
>>>>>> Does anyone have an elegant solution that does something like  
>>>>>> this?
>>>>>> Thread-specific "static" variables?  Just one REFANY would be  
>>>>>> enough
>>>>>> for a lot of uses...  seems to me this should be a frequently
>>>>>> occurring problem?
>>>>>>
>>>>>>  Best regards,
>>>>>>    Mika
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>