[M3devel] M3 programming problem : GC efficiency / per-thread storage areas?

Fri Oct 17 08:32:15 CEST 2008

Well, I was thinking of something even simpler.  A Thread.T is an
OBJECT.  It's garbage collected just like any other object, is it
not?  

Why can't the thing that makes new threads simply include a single
globally visible field in every Thread.T, of type REFANY?  Call it "data".

Then you can always manipulate Thread.Self().data as you see fit
without any need for locks.  There can be no problem with this as
long as it is always manipulated from within that thread.
Of course this can be trivially encapsulated by not revealing "data"
and indeed always accessing it as Thread.Self().data.

You would not normally access this from any other thread.  It's indeed
only meant to be used in the idiom

  x := Allocate();
  TRY
    DoSomething(x)
  FINALLY
    Free(x)
  END

It's also not really a "Free" but just returning the object to a free
list (there can be no unsafe behavior here).

As a "nicer" interface, one could register routines with a public
interface, asking it to manufacture some kind of thread globals.
For maximum sanity, they would be visible inside the MODULE that
requested them, but I'm not sure how to accomplish this.  And of
course there's not much point in any of this unless it can be made
efficient or else a mutex plus a true global will work just as well.

What I'm talking about I guess could be done by hacking up Thread.Fork()
to return a subtype of Thread.T, but that won't work for the first
thread.  But with this method you could have arbitrary fields (and
methods) attached to a Thread.T.  How to collect everything you need
is a different story...

I'm not asking for a new language feature... really was just wondering
if anyone had tried anything like this before, and now am rambling a
bit.

     Mika

Jay writes:
>
>Making this per-thread is a fairly classic good improvement.
>
>You need to worry about what happens with many threads, and being sure to cleanup when a thread dies, and a
>llowing for a free to come in from any thread.
>
>A good way to mitigate all those problems is to use a small fixed size cache instead of per-thread. Includi
>ng an array of mutexes.
>
>If "thread ids" have adequate distribution, just use their lower bits as an array index. If not, have a glo
>bal counter that gets assigned into the thread on first use per-thread.
>
>The cache could also be more than one element.
>
>How do you manage okToFree?
>
>Windows has __declspec(thread), which is an optimized form of aTlsGetValue/TlsSetValue, but it doesn't work
> with dynamically loaded .dlls before Vista, and isn't __declspec(fiber) like maybe it should be.
> 
> - Jay
>
>----------------------------------------
>> To: hosking at cs.purdue.edu
>> Date: Thu, 16 Oct 2008 16:30:01 -0700
>> From: mika at async.caltech.edu
>> CC: m3devel at elegosoft.com; mika at camembert.async.caltech.edu
>> Subject: Re: [M3devel] M3 programming problem : GC efficiency / per-thread	storage areas?
>> 
>> Hi Tony,
>> 
>> I figured you would chime in!
>> 
>> Yes, @M3noincremental seems to make things consistently a tad bit
>> slower (but a very small difference), on both FreeBSD and Linux.
>> @M3nogc makes a bigger difference, of course.
>> 
>> Unfortunately I seem to have lost the code that did a lot of memory
>> allocations.  My tricks (as described in the email---and others!)
>> have removed most of the troublesome memory allocations, but now
>> I'm stuck with the mutex instead...
>> 
>>       Mika
>> 
>> Tony Hosking writes:
>>>Have you tried running @M3noincremental?
>>>
>>>On 16 Oct 2008, at 23:32, Mika Nystrom wrote:
>>>
>>>> Hello Modula-3 people,
>>>>
>>>> As I mentioned in an earlier email about printing structures (thanks
>>>> Darko), I'm in the midst of coding an interpreter embedded in
>>>> Modula-3.  It's a Scheme interpreter, loosely based on Peter Norvig's
>>>> JScheme for Java (well it was at first strongly based, but more and
>>>> more loosely, if you know what I mean...)
>>>>
>>>> I expected that the performance of the interpreter would be much
>>>> better in Modula-3 than in Java, and I have been testing on two
>>>> different systems.  One is my ancient FreeBSD-4.11 with an old PM3,
>>>> and the other is CM3 on a recent Debian system.  What I am finding
>>>> is that it is indeed much faster than JScheme on FreeBSD/PM3 (getting
>>> close to ten times as fast on some tasks at this point), but on
>>>> Linux/CM3 it is much closer in speed to JScheme than I would like.
>>>>
>>>> When I started, with code that was essentially equivalent to JScheme,
>>>> I found that it was a bit slower than JScheme on Linux/CM3 and
>>>> possibly 2x as fast on FreeBSD/PM3.  On Linux/CM3, it appears to
>>>> spend most of its time in (surprise, surprise!) memory allocation
>>>> and garbage collection.  The speedup I have achieved between the
>>>> first implementation and now was due to the use of Modula-3 constructs
>>>> that are superior to Java's, such as the use of arrays of RECORDs
>>>> to make small stacks rather than linked lists.  (I get readable
>>>> code with much fewer memory allocations and GC work.)
>>>>
>>>> Now, since this is an interpreter, I as the implementer have limited
>>>> control over how much memory is allocated and freed, and where it is
>>>> needed.  However, I can sometimes fall back on C-style memory  
>>>> management,
>>>> but I would like to do it in a safe way.  For instance, I have  
>>>> special-cased
>>>> evaluation of Scheme primitives, as follows.
>>>>
>>>> Under the "normal" implementation, a list of things to evaluate is
>>>> built up, passed to an evaluation function, and then the GC is left
>>>> to sweep up the mess.  The problem is that there are various tricky
>>>> routes by which references can escape the evaluator, so you can't
>>>> just assume that what you put in is going to be dead right after
>>>> an eval and free it.  Instead, I set a flag in the evaluator, which
>>>> is TRUE if it is OK to free the list after the eval and FALSE if
>>>> it's unclear (in which case the problem is left up to the GC).
>>>>
>>>> For the vast majority of Scheme primitives, one can indeed free the
>>>> list right after the eval.  Now of course I am not interested
>>>> in unsafe code, so what I do is this:
>>>>
>>>> TYPE Pair = OBJECT first, rest : REFANY; END;
>>>>
>>>> VAR
>>>>  mu := NEW(MUTEX);
>>>>  free : Pair := NIL;
>>>>
>>>> PROCEDURE GetPair() : Pair =
>>>>  BEGIN
>>>>    LOCK mu DO
>>>>      IF free # NIL THEN
>>>>        TRY
>>>>          RETURN free
>>>>        FINALLY
>>>>          free := free.rest
>>>>        END
>>>>      END
>>>>    END;
>>>>    RETURN NEW(Pair)
>>>>  END GetPair;
>>>>
>>>> PROCEDURE ReturnPair(cons : Pair) =
>>>>  BEGIN
>>>>    cons.first := NIL;
>>>>    LOCK mu DO
>>>>      cons.rest := free;
>>>>      free := cons
>>>>    END
>>>>  END ReturnPair;
>>>>
>>>> my eval code looks like
>>>>
>>>> VAR okToFree : BOOLEAN; BEGIN
>>>>
>>>>   args := GetPair(); ...
>>>>   result := EvalPrimitive(args, (*VAR OUT*) okToFree);
>>>>
>>>>   IF okToFree THEN ReturnPair(args) END;
>>>>   RETURN result
>>>> END
>>>>
>>>> and this does work well.  In fact it speeds up the Linux  
>>>> implementation
>>>> by almost 100% to recycle the lists like this *just* for the
>>>> evaluation of Scheme primitives.
>>>>
>>>> But it's still ugly, isn't it?  There's a mutex, and a global
>>>> variable.  And yes, the time spent messing with the mutex is
>>>> noticeable, and I haven't even made the code multi-threaded yet
>>>> (and that is coming!)
>>>>
>>>> So I'm thinking, what I really want is a structure that is attached
>>>> to my current Thread.T.  I want to be able to access just a single
>>>> pointer (like the free list) but be sure it is unique to my current
>>>> thread.  No locking would be necessary if I could do this.
>>>>
>>>> Does anyone have an elegant solution that does something like this?
>>>> Thread-specific "static" variables?  Just one REFANY would be enough
>>>> for a lot of uses...  seems to me this should be a frequently
>>>> occurring problem?
>>>>
>>>>     Best regards,
>>>>       Mika
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>