<html>
<head>
<style><!--
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 10pt;
font-family:Verdana
}
--></style>
</head>
<body class='hmmessage'>
> We can optimize this somewhat on most systems with __thread. I had that in briefly<BR>
<BR>
<BR>
I did a little bit of testing here.<BR>
<BR>
<BR>
__thread takes a few forms, depending on which of -fPIC and -shared you use.<BR>
<BR>
Once you do throw in -fPIC and -shared, I have found __thread to be significantly slower on Solaris/sparc and Linux/powerpc32, slower or a wash on Linux/amd64, and twice as fast as pthread_getspecific on Linux/x86.<BR>
I doesn't appear supported at all on Darwin, though pthread_getspecific are very fast there (albeit not inlined).<BR>
I didn't test *BSD.<BR>
My testing was not very scientific.<BR>
However -fPIC and/or -shared imply a function call to access __thread variables.<BR>
That's probably the big factor. Without -fPIC/-shared, there is no function call.<BR>
<BR>
<BR>
If you are going to access them multiple times in a function then there is a probably an optimization to be had -- caching their address. If the variables are larger than a pointer, probably then also an optimization to be had.<BR>
<BR>
<BR>
We could do that first thing for certain -- PushFrame could return the address and PopFrame would be much faster.<BR>
However another angle here is to eliminate PushFrame/PopFrame, by using libunwind. I think we should look into libunwind for the next release. The thread locals will (mostly) remain but accesses to them greatly decline.<BR>
<BR>
<BR>
We compile everything -fPIC and -shared.<BR>
Some systems (libtool) compile things once that way and once not, providing pairs of libraries, depending on intended use.<BR>
<BR>
<BR>
I should point out also that userthreads have been greatly deoptimized in the current tree (by me, Tony approved), because they used to inline PushFrame/PopFrame, but they don't any longer.<BR>
<BR>
<BR>
(Historically on NT, __declspec(thread) only worked in code in the .exe or statically loaded by the .exe -- that is, not in .dlls loaded with LoadLibrary. However that limitation was removed in Vista. I expect __declspec(thread) is much faster than TlsGetValue, but I assume we'll support pre-Vista for a while longer so not interesting..)<BR>
<BR>
<BR>
- Jay<BR><BR> <BR>
<HR id=stopSpelling>
From: jay.krell@cornell.edu<BR>To: dragisha@m3w.org; mika@async.async.caltech.edu<BR>CC: m3devel@elegosoft.com<BR>Subject: RE: [M3devel] userthreads vs. pthreads performance?<BR>Date: Mon, 29 Mar 2010 03:15:41 +0000<BR><BR>
<STYLE>
.ExternalClass .ecxhmmessage P
{padding:0px;}
.ExternalClass body.ecxhmmessage
{font-size:10pt;font-family:Verdana;}
</STYLE>
> Getting thread locals should not require a kernel call<BR> <BR>Indeed, on Linux/x86 it does not, looks pretty ok:<BR> <BR>00000380 <__pthread_getspecific>:<BR> 380: 55 push %ebp<BR> 381: 89 e5 mov %esp,%ebp<BR> 383: 8b 55 08 mov 0x8(%ebp),%edx<BR> 386: 81 fa ff 03 00 00 cmp $0x3ff,%edx<BR> 38c: 76 04 jbe 392 <__pthread_getspecific+0x12><BR> 38e: 5d pop %ebp<BR> 38f: 31 c0 xor %eax,%eax<BR> 391: c3 ret<BR><BR> 392: 89 d0 mov %edx,%eax<BR> 394: c1 e8 05 shr $0x5,%eax<BR> 397: 8d 0c 85 1c 01 00 00 lea 0x11c(,%eax,4),%ecx<BR> 39e: 65 8b 01 mov %gs:(%ecx),%eax<BR> 3a1: 85 c0 test %eax,%eax<BR> 3a3: 74 e9 je 38e <__pthread_getspecific+0xe><BR> 3a5: 8b 04 d5 00 00 00 00 mov 0x0(,%edx,8),%eax<BR> 3ac: 85 c0 test %eax,%eax<BR> 3ae: 74 de je 38e <__pthread_getspecific+0xe><BR> 3b0: 65 8b 01 mov %gs:(%ecx),%eax<BR> 3b3: 83 e2 1f and $0x1f,%edx<BR> 3b6: 8b 04 90 mov (%eax,%edx,4),%eax<BR> 3b9: 5d pop %ebp<BR> 3ba: c3 ret<BR><BR><BR>> Entering an uncontended pthread mutex should not be expensive<BR><BR>Linux/x86:<BR> <BR>00001020 <__pthread_self>:<BR> 1020: 55 push %ebp<BR> 1021: 89 e5 mov %esp,%ebp<BR> 1023: 65 a1 50 00 00 00 mov %gs:0x50,%eax<BR> 1029: 5d pop %ebp<BR> 102a: c3 ret<BR> 102b: 90 nop<BR> 102c: 8d 74 26 00 lea 0x0(%esi),%esi<BR> <BR> <BR>pretty lame, five instructions were only two are needed.<BR> <BR><BR>000004f0 <__pthread_mutex_lock>:<BR><BR>.. too much to read through..but I think no kernel call..<BR> <BR> - Jay<BR> <BR>
<HR id=ecxstopSpelling>
From: jay.krell@cornell.edu<BR>To: dragisha@m3w.org; mika@async.async.caltech.edu<BR>CC: m3devel@elegosoft.com<BR>Subject: RE: [M3devel] userthreads vs. pthreads performance?<BR>Date: Sun, 28 Mar 2010 20:46:01 +0000<BR><BR>
<STYLE>
.ExternalClass .ecxhmmessage P
{padding:0px;}
.ExternalClass body.ecxhmmessage
{font-size:10pt;font-family:Verdana;}
</STYLE>
O(1) scheduling is not a new idea. Just look at NT and probably Solaris and probably all the other non-free systems (AIX, Irix, HP-UX, Tru64, VMS, etc.)<BR><BR>Getting thread locals should not require a kernel call. It doesn't on NT. We can optimize this somewhat on most systems with __thread. I had that in briefly.<BR><BR>Entering an uncontended pthread mutex should not be expensive -- at least no kernel call, but granted a call and atomic op. Two calls because of the C layer.<BR>But user threads pay for a call too of course.<BR><BR>Maybe I should profile some of this..<BR><BR>- Jay<BR><BR>> From: dragisha@m3w.org<BR>> To: mika@async.async.caltech.edu<BR>> Date: Sun, 28 Mar 2010 21:14:57 +0200<BR>> CC: m3devel@elegosoft.com<BR>> Subject: Re: [M3devel] userthreads vs. pthreads performance?<BR>> <BR>> I remember reading (long time ago) about how these (FUTEXes) are<BR>> efficient in LINUX... Can I have your test code to try?<BR>> <BR>> On Sun, 2010-03-28 at 12:11 -0700, Mika Nystrom wrote:<BR>> > Well I have run programs on PPC_DARWIN and FreeBSD<X> and seen these sorts of things...<BR>> > <BR>> > =?UTF-8?Q?Dragi=C5=A1a_Duri=C4=87?= writes:<BR>> > >Which platform?<BR>> > ><BR>> > >On Sun, 2010-03-28 at 11:57 -0700, Mika Nystrom wrote:<BR>> > >> Yep, sounds right. <BR>> > >> <BR>> > >> I was profiling some other thread-using code that slowed down<BR>> > >> enormously<BR>> > >> because of pthreads and it turned out the program was spending ~95%<BR>> > >> of its time in accessing the thread locals via one of the pthread_<BR>> > >> functions.<BR>> > >> (The overhead of entering the kernel.)<BR>> > >-- <BR>> > >Dragiša Durić <dragisha@m3w.org><BR>> -- <BR>> Dragiša Durić <dragisha@m3w.org><BR>> <BR> </body>
</html>