[M3devel] userthreads vs. pthreads performance?

Mon Mar 29 10:24:47 CEST 2010

 > We can optimize this somewhat on most systems with __thread. I had that in briefly

I did a little bit of testing here.

__thread takes a few forms, depending on which of -fPIC and -shared you use.

Once you do throw in -fPIC and -shared, I have found __thread to be significantly slower on Solaris/sparc and Linux/powerpc32, slower or a wash on Linux/amd64, and twice as fast as pthread_getspecific on Linux/x86.

I doesn't appear supported at all on Darwin, though pthread_getspecific are very fast there (albeit not inlined).

I didn't test *BSD.

My testing was not very scientific.

However -fPIC and/or -shared imply a function call to access __thread variables.

 That's probably the big factor. Without -fPIC/-shared, there is no function call.

If you are going to access them multiple times in a function then there is a probably an optimization to be had -- caching their address. If the variables are larger than a pointer, probably then also an optimization to be had.

We could do that first thing for certain -- PushFrame could return the address and PopFrame would be much faster.

However another angle here is to eliminate PushFrame/PopFrame, by using libunwind. I think we should look into libunwind for the next release. The thread locals will (mostly) remain but accesses to them greatly decline.

We compile everything -fPIC and -shared.

Some systems (libtool) compile things once that way and once not, providing pairs of libraries, depending on intended use.

I should point out also that userthreads have been greatly deoptimized in the current tree (by me, Tony approved), because they used to inline PushFrame/PopFrame, but they don't any longer.

(Historically on NT, __declspec(thread) only worked in code in the .exe or statically loaded by the .exe -- that is, not in .dlls loaded with LoadLibrary. However that limitation was removed in Vista. I expect __declspec(thread) is much faster than TlsGetValue, but I assume we'll support pre-Vista for a while longer so not interesting..)

 - Jay

From: jay.krell at cornell.edu
To: dragisha at m3w.org; mika at async.async.caltech.edu
CC: m3devel at elegosoft.com
Subject: RE: [M3devel] userthreads vs. pthreads performance?
Date: Mon, 29 Mar 2010 03:15:41 +0000

 > Getting thread locals should not require a kernel call

Indeed, on Linux/x86 it does not, looks pretty ok:

00000380 <__pthread_getspecific>:
 380:   55                      push   %ebp
 381:   89 e5                   mov    %esp,%ebp
 383:   8b 55 08                mov    0x8(%ebp),%edx
 386:   81 fa ff 03 00 00       cmp    $0x3ff,%edx
 38c:   76 04                   jbe    392 <__pthread_getspecific+0x12>
 38e:   5d                      pop    %ebp
 38f:   31 c0                   xor    %eax,%eax
 391:   c3                      ret

 392:   89 d0                   mov    %edx,%eax
 394:   c1 e8 05                shr    $0x5,%eax
 397:   8d 0c 85 1c 01 00 00    lea    0x11c(,%eax,4),%ecx
 39e:   65 8b 01                mov    %gs:(%ecx),%eax
 3a1:   85 c0                   test   %eax,%eax
 3a3:   74 e9                   je     38e <__pthread_getspecific+0xe>
 3a5:   8b 04 d5 00 00 00 00    mov    0x0(,%edx,8),%eax
 3ac:   85 c0                   test   %eax,%eax
 3ae:   74 de                   je     38e <__pthread_getspecific+0xe>
 3b0:   65 8b 01                mov    %gs:(%ecx),%eax
 3b3:   83 e2 1f                and    $0x1f,%edx
 3b6:   8b 04 90                mov    (%eax,%edx,4),%eax
 3b9:   5d                      pop    %ebp
 3ba:   c3                      ret

> Entering an uncontended pthread mutex should not be expensive

Linux/x86:

00001020 <__pthread_self>:
    1020:       55                      push   %ebp
    1021:       89 e5                   mov    %esp,%ebp
    1023:       65 a1 50 00 00 00       mov    %gs:0x50,%eax
    1029:       5d                      pop    %ebp
    102a:       c3                      ret
    102b:       90                      nop
    102c:       8d 74 26 00             lea    0x0(%esi),%esi

pretty lame, five instructions were only two are needed.

000004f0 <__pthread_mutex_lock>:

.. too much to read through..but I think no kernel call..

 - Jay

From: jay.krell at cornell.edu
To: dragisha at m3w.org; mika at async.async.caltech.edu
CC: m3devel at elegosoft.com
Subject: RE: [M3devel] userthreads vs. pthreads performance?
Date: Sun, 28 Mar 2010 20:46:01 +0000

O(1) scheduling is not a new idea. Just look at NT and probably Solaris and probably all the other non-free systems (AIX, Irix, HP-UX, Tru64, VMS, etc.)

Getting thread locals should not require a kernel call. It doesn't on NT. We can optimize this somewhat on most systems with __thread. I had that in briefly.

Entering an uncontended pthread mutex should not be expensive -- at least no kernel call, but granted a call and atomic op. Two calls because of the C layer.
But user threads pay for a call too of course.

Maybe I should profile some of this..

- Jay

> From: dragisha at m3w.org
> To: mika at async.async.caltech.edu
> Date: Sun, 28 Mar 2010 21:14:57 +0200
> CC: m3devel at elegosoft.com
> Subject: Re: [M3devel] userthreads vs. pthreads performance?
> 
> I remember reading (long time ago) about how these (FUTEXes) are
> efficient in LINUX... Can I have your test code to try?
> 
> On Sun, 2010-03-28 at 12:11 -0700, Mika Nystrom wrote:
> > Well I have run programs on PPC_DARWIN and FreeBSD<X> and seen these sorts of things...
> > 
> > =?UTF-8?Q?Dragi=C5=A1a_Duri=C4=87?= writes:
> > >Which platform?
> > >
> > >On Sun, 2010-03-28 at 11:57 -0700, Mika Nystrom wrote:
> > >> Yep, sounds right. 
> > >> 
> > >> I was profiling some other thread-using code that slowed down
> > >> enormously
> > >> because of pthreads and it turned out the program was spending ~95%
> > >> of its time in accessing the thread locals via one of the pthread_
> > >> functions.
> > >> (The overhead of entering the kernel.)
> > >-- 
> > >Dragiša Durić <dragisha at m3w.org>
> -- 
> Dragiša Durić <dragisha at m3w.org>
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20100329/16111798/attachment-0002.html>