[M3devel] userthreads vs. pthreads performance?

Mon Mar 29 16:42:17 CEST 2010

Guys,

You are looking in the wrong place.  The whole good thing about having a language-defined thread model is that we get to implement the thread primitives in the language run-time, and we can take advantage of language-specific semantics.   There is no requirement that a Modula-3 mutex even map to a pthread_mutex_t.  Java monitors are implemented in modern JVMs so that lock/unlock in the common case doesn't require any calls or any atomic operations!  We can do much the same for Modula-3.  My group at Purdue has done a fair amount of work on this in the context of Java, and when I can find the time I was hoping to do the same for Modula-3.  The only requirement for M3 was that we have proper support for atomic ops in the language upon which to build the synchronisation primitives.  We are close to having this now.  Let me sketch a design:

The common case is that a mutex is only ever manipulated by one thread (i.e., never shared), in which case it suffices to "bias" the mutex to that thread.  Locking/unlocking is simply a matter of checking to see that the bias belongs to the current thread.  No need for an atomic operation.  If the thread already has the bias then locking/unlocking is simply a matter of setting a bit in the mutex.  If another thread comes along and needs to lock the mutex then it must first revoke the bias of the owning thread.  This can be expensive (assuming it occurs infrequently) and in our case probably means stopping the thread having the bias, revoking the bias, then restarting it.

Another case is when a mutex is locked/unlocked by multiple threads but there is never contention (i.e., no thread tries to acquire while another thread holds).  In this case we never need a wait queue for the mutex so we can simply store the lock owner in the mutex and test using atomic ops.  Spinning is often useful here to avoid needing to inflate if contention ever does arise: if the thread holding the lock gets out of the mutex quickly then the spinner can move in quickly.  After some number of spins we generally need to inflate the lock to allocate a wait queue (to avoiding hogging the processor).

Finally, the case where many threads are contending on the inflated lock (with wait queue).  The only question now is when to deflate.  Our current heuristic is to deflate when the last thread releases the lock and notices that there are no other threads waiting.  This seems to work well in practice, but of course there are pathological cases.

Note that in no case have I mentioned the need for a pthread_mutex (though pthread locks/conditions are used to manage threads that must block on a Java quit queue).

We ought to be able to do much the same in Modula-3.

Antony Hosking | Associate Professor | Computer Science | Purdue University
305 N. University Street | West Lafayette | IN 47907 | USA
Office +1 765 494 6001 | Mobile +1 765 427 5484

On 29 Mar 2010, at 04:24, Jay K wrote:

>  > We can optimize this somewhat on most systems with __thread. I had that in briefly
>  
>  
> I did a little bit of testing here.
>  
>  
> __thread takes a few forms, depending on which of -fPIC and -shared you use.
>  
> Once you do throw in -fPIC and -shared, I have found __thread to be significantly slower on Solaris/sparc and Linux/powerpc32, slower or a wash on Linux/amd64, and twice as fast as pthread_getspecific on Linux/x86.
> I doesn't appear supported at all on Darwin, though pthread_getspecific are very fast there (albeit not inlined).
> I didn't test *BSD.
> My testing was not very scientific.
> However -fPIC and/or -shared imply a function call to access __thread variables.
>  That's probably the big factor. Without -fPIC/-shared, there is no function call.
>  
>  
> If you are going to access them multiple times in a function then there is a probably an optimization to be had -- caching their address. If the variables are larger than a pointer, probably then also an optimization to be had.
>  
>  
> We could do that first thing for certain -- PushFrame could return the address and PopFrame would be much faster.
> However another angle here is to eliminate PushFrame/PopFrame, by using libunwind. I think we should look into libunwind for the next release. The thread locals will (mostly) remain but accesses to them greatly decline.
>  
>  
> We compile everything -fPIC and -shared.
> Some systems (libtool) compile things once that way and once not, providing pairs of libraries, depending on intended use.
>  
>  
> I should point out also that userthreads have been greatly deoptimized in the current tree (by me, Tony approved), because they used to inline PushFrame/PopFrame, but they don't any longer.
>  
>  
> (Historically on NT, __declspec(thread) only worked in code in the .exe or statically loaded by the .exe -- that is, not in .dlls loaded with LoadLibrary. However that limitation was removed in Vista. I expect __declspec(thread) is much faster than TlsGetValue, but I assume we'll support pre-Vista for a while longer so not interesting..)
>  
>  
>  - Jay
> 
>  
> From: jay.krell at cornell.edu
> To: dragisha at m3w.org; mika at async.async.caltech.edu
> CC: m3devel at elegosoft.com
> Subject: RE: [M3devel] userthreads vs. pthreads performance?
> Date: Mon, 29 Mar 2010 03:15:41 +0000
> 
>  > Getting thread locals should not require a kernel call
>  
> Indeed, on Linux/x86 it does not, looks pretty ok:
>  
> 00000380 <__pthread_getspecific>:
>  380:   55                      push   %ebp
>  381:   89 e5                   mov    %esp,%ebp
>  383:   8b 55 08                mov    0x8(%ebp),%edx
>  386:   81 fa ff 03 00 00       cmp    $0x3ff,%edx
>  38c:   76 04                   jbe    392 <__pthread_getspecific+0x12>
>  38e:   5d                      pop    %ebp
>  38f:   31 c0                   xor    %eax,%eax
>  391:   c3                      ret
> 
>  392:   89 d0                   mov    %edx,%eax
>  394:   c1 e8 05                shr    $0x5,%eax
>  397:   8d 0c 85 1c 01 00 00    lea    0x11c(,%eax,4),%ecx
>  39e:   65 8b 01                mov    %gs:(%ecx),%eax
>  3a1:   85 c0                   test   %eax,%eax
>  3a3:   74 e9                   je     38e <__pthread_getspecific+0xe>
>  3a5:   8b 04 d5 00 00 00 00    mov    0x0(,%edx,8),%eax
>  3ac:   85 c0                   test   %eax,%eax
>  3ae:   74 de                   je     38e <__pthread_getspecific+0xe>
>  3b0:   65 8b 01                mov    %gs:(%ecx),%eax
>  3b3:   83 e2 1f                and    $0x1f,%edx
>  3b6:   8b 04 90                mov    (%eax,%edx,4),%eax
>  3b9:   5d                      pop    %ebp
>  3ba:   c3                      ret
> 
> 
> > Entering an uncontended pthread mutex should not be expensive
> 
> Linux/x86:
>  
> 00001020 <__pthread_self>:
>     1020:       55                      push   %ebp
>     1021:       89 e5                   mov    %esp,%ebp
>     1023:       65 a1 50 00 00 00       mov    %gs:0x50,%eax
>     1029:       5d                      pop    %ebp
>     102a:       c3                      ret
>     102b:       90                      nop
>     102c:       8d 74 26 00             lea    0x0(%esi),%esi
>  
>  
> pretty lame, five instructions were only two are needed.
>  
> 
> 000004f0 <__pthread_mutex_lock>:
> 
> .. too much to read through..but I think no kernel call..
>  
>  - Jay
>  
> From: jay.krell at cornell.edu
> To: dragisha at m3w.org; mika at async.async.caltech.edu
> CC: m3devel at elegosoft.com
> Subject: RE: [M3devel] userthreads vs. pthreads performance?
> Date: Sun, 28 Mar 2010 20:46:01 +0000
> 
> O(1) scheduling is not a new idea. Just look at NT and probably Solaris and probably all the other non-free systems (AIX, Irix, HP-UX, Tru64, VMS, etc.)
> 
> Getting thread locals should not require a kernel call. It doesn't on NT. We can optimize this somewhat on most systems with __thread. I had that in briefly.
> 
> Entering an uncontended pthread mutex should not be expensive -- at least no kernel call, but granted a call and atomic op. Two calls because of the C layer.
> But user threads pay for a call too of course.
> 
> Maybe I should profile some of this..
> 
> - Jay
> 
> > From: dragisha at m3w.org
> > To: mika at async.async.caltech.edu
> > Date: Sun, 28 Mar 2010 21:14:57 +0200
> > CC: m3devel at elegosoft.com
> > Subject: Re: [M3devel] userthreads vs. pthreads performance?
> > 
> > I remember reading (long time ago) about how these (FUTEXes) are
> > efficient in LINUX... Can I have your test code to try?
> > 
> > On Sun, 2010-03-28 at 12:11 -0700, Mika Nystrom wrote:
> > > Well I have run programs on PPC_DARWIN and FreeBSD<X> and seen these sorts of things...
> > > 
> > > =?UTF-8?Q?Dragi=C5=A1a_Duri=C4=87?= writes:
> > > >Which platform?
> > > >
> > > >On Sun, 2010-03-28 at 11:57 -0700, Mika Nystrom wrote:
> > > >> Yep, sounds right. 
> > > >> 
> > > >> I was profiling some other thread-using code that slowed down
> > > >> enormously
> > > >> because of pthreads and it turned out the program was spending ~95%
> > > >> of its time in accessing the thread locals via one of the pthread_
> > > >> functions.
> > > >> (The overhead of entering the kernel.)
> > > >-- 
> > > >Dragiša Durić <dragisha at m3w.org>
> > -- 
> > Dragiša Durić <dragisha at m3w.org>
> > 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20100329/992f064c/attachment-0002.html>