<html>

<head>

<style><!--

.hmmessage P

{

margin:0px;

padding:0px

}

body.hmmessage

{

font-size: 10pt;

font-family:Verdana

}

--></style>

</head>

<body class='hmmessage'>

 > Getting thread locals should not require a kernel call<BR>

 <BR>

Indeed, on Linux/x86 it does not, looks pretty ok:<BR>

 <BR>

00000380 <__pthread_getspecific>:<BR> 380:   55                      push   %ebp<BR> 381:   89 e5                   mov    %esp,%ebp<BR> 383:   8b 55 08                mov    0x8(%ebp),%edx<BR> 386:   81 fa ff 03 00 00       cmp    $0x3ff,%edx<BR> 38c:   76 04                   jbe    392 <__pthread_getspecific+0x12><BR> 38e:   5d                      pop    %ebp<BR> 38f:   31 c0                   xor    %eax,%eax<BR> 391:   c3                      ret<BR><BR>

 392:   89 d0                   mov    %edx,%eax<BR> 394:   c1 e8 05                shr    $0x5,%eax<BR> 397:   8d 0c 85 1c 01 00 00    lea    0x11c(,%eax,4),%ecx<BR> 39e:   65 8b 01                mov    %gs:(%ecx),%eax<BR> 3a1:   85 c0                   test   %eax,%eax<BR> 3a3:   74 e9                   je     38e <__pthread_getspecific+0xe><BR> 3a5:   8b 04 d5 00 00 00 00    mov    0x0(,%edx,8),%eax<BR> 3ac:   85 c0                   test   %eax,%eax<BR> 3ae:   74 de                   je     38e <__pthread_getspecific+0xe><BR> 3b0:   65 8b 01                mov    %gs:(%ecx),%eax<BR> 3b3:   83 e2 1f                and    $0x1f,%edx<BR> 3b6:   8b 04 90                mov    (%eax,%edx,4),%eax<BR> 3b9:   5d                      pop    %ebp<BR> 3ba:   c3                      ret<BR><BR><BR>> Entering an uncontended pthread mutex should not be expensive<BR><BR>

Linux/x86:<BR>

 <BR>

00001020 <__pthread_self>:<BR>    1020:       55                      push   %ebp<BR>    1021:       89 e5                   mov    %esp,%ebp<BR>    1023:       65 a1 50 00 00 00       mov    %gs:0x50,%eax<BR>    1029:       5d                      pop    %ebp<BR>    102a:       c3                      ret<BR>    102b:       90                      nop<BR>    102c:       8d 74 26 00             lea    0x0(%esi),%esi<BR>

 <BR>

 <BR>

pretty lame, five instructions were only two are needed.<BR>

 <BR>

<BR>000004f0 <__pthread_mutex_lock>:<BR><BR>

.. too much to read through..but I think no kernel call..<BR>

 <BR>

 - Jay<BR>

 <BR>

<HR id=stopSpelling>

From: jay.krell@cornell.edu<BR>To: dragisha@m3w.org; mika@async.async.caltech.edu<BR>CC: m3devel@elegosoft.com<BR>Subject: RE: [M3devel] userthreads vs. pthreads performance?<BR>Date: Sun, 28 Mar 2010 20:46:01 +0000<BR><BR>

<STYLE>

.ExternalClass .ecxhmmessage P

{padding:0px;}

.ExternalClass body.ecxhmmessage

{font-size:10pt;font-family:Verdana;}

</STYLE>

O(1) scheduling is not a new idea. Just look at NT and probably Solaris and probably all the other non-free systems (AIX, Irix, HP-UX, Tru64, VMS, etc.)<BR><BR>Getting thread locals should not require a kernel call. It doesn't on NT. We can optimize this somewhat on most systems with __thread. I had that in briefly.<BR><BR>Entering an uncontended pthread mutex should not be expensive -- at least no kernel call, but granted a call and atomic op. Two calls because of the C layer.<BR>But user threads pay for a call too of course.<BR><BR>Maybe I should profile some of this..<BR><BR>- Jay<BR><BR>> From: dragisha@m3w.org<BR>> To: mika@async.async.caltech.edu<BR>> Date: Sun, 28 Mar 2010 21:14:57 +0200<BR>> CC: m3devel@elegosoft.com<BR>> Subject: Re: [M3devel] userthreads vs. pthreads performance?<BR>> <BR>> I remember reading (long time ago) about how these (FUTEXes) are<BR>> efficient in LINUX... Can I have your test code to try?<BR>> <BR>> On Sun, 2010-03-28 at 12:11 -0700, Mika Nystrom wrote:<BR>> > Well I have run programs on PPC_DARWIN and FreeBSD<X> and seen these sorts of things...<BR>> > <BR>> > =?UTF-8?Q?Dragi=C5=A1a_Duri=C4=87?= writes:<BR>> > >Which platform?<BR>> > ><BR>> > >On Sun, 2010-03-28 at 11:57 -0700, Mika Nystrom wrote:<BR>> > >> Yep, sounds right. <BR>> > >> <BR>> > >> I was profiling some other thread-using code that slowed down<BR>> > >> enormously<BR>> > >> because of pthreads and it turned out the program was spending ~95%<BR>> > >> of its time in accessing the thread locals via one of the pthread_<BR>> > >> functions.<BR>> > >> (The overhead of entering the kernel.)<BR>> > >-- <BR>> > >Dragiša Durić <dragisha@m3w.org><BR>> -- <BR>> Dragiša Durić <dragisha@m3w.org><BR>> <BR>                                        </body>

</html>