[M3devel] M3 programming problem : GC efficiency / per-thread storage areas?

Sat Oct 18 00:42:35 CEST 2008

Right and wrong.

Right Tony was referring to Upthread.getspecific. Or on Windows WinBase.TlsGetValue.
Wrong that this necessarily incurs a switch to the supervisor/kernel, and perhaps wrong to call that at a "context switch". It depends on the operating system.

I will explain.

On Windows/x86, the FS register points to a partly documented per-thread data structure.
C and C++ exception handling use FS:0.
Disassemble any code. You'll find it is used. Not by Modula-3 though.

Disassemble TlsGetValue.

 cdb /z %windir%\system32\kernel32.dll  

0:000> uf kernel32!TlsGetValue
kernel32!TlsGetValue:

 typical looking prolog.. 
7dd813e0 8bff            mov     edi,edi
7dd813e2 55              push    ebp
7dd813e3 8bec            mov     ebp,esp

 fs:18 contains a "normal" "linear" pointer to fs:0 
 Get that pointer. 
7dd813e5 64a118000000    mov     eax,dword ptr fs:[00000018h]

 get the index 
7dd813eb 8b4d08          mov     ecx,dword ptr [ebp+8]

 SetLastError(0) 
7dd813ee 83603400        and     dword ptr [eax+34h],0

  There are 64 preallocated thread local slots -- compare the index to 64. 
7dd813f2 83f940          cmp     ecx,40h   

  If it above or equal to 64, go use the non preallocated slots. 
7dd813f5 0f8353e20200    jae     kernel32!lstrcmpi+0x4b22 (7ddaf64e)

  preallocated slots are at fs:e10; get the data and done  
kernel32!TlsGetValue+0x1b:
7dd813fb 8b8488100e0000  mov     eax,dword ptr [eax+ecx*4+0E10h]

 epilog 

kernel32!TlsGetValue+0x22:
7dd81402 5d              pop     ebp
7dd81403 c20400          ret     4

 get here for indices>= 64
 compare index to 1088 == 1024 + 64, as there are another 1024 more slowly available slots  

kernel32!lstrcmpi+0x4b22:
7ddaf64e 81f940040000    cmp     ecx,440h

 if it is below 1024, go use those slots 

7ddaf654 7211            jb      kernel32!lstrcmpi+0x4b3b (7ddaf667)

 index is above or equal to 1024, SetLastError(invalid parameter) 

kernel32!lstrcmpi+0x4b2a:
7ddaf656 680d0000c0      push    0C000000Dh
7ddaf65b e80025fdff      call    kernel32!GetProcessHeap+0x12 (7dd81b60)

 and return 0 -- 0 is not unambiguously an error -- that's why last error was cleared at the start 

kernel32!lstrcmpi+0x4b34:
7ddaf660 33c0            xor     eax,eax
7ddaf662 e99b1dfdff      jmp     kernel32!TlsGetValue+0x22 (7dd81402)

 This is where the slots between 64 and 1088 are used. 
 Get pointer from FS:F94 and compare to null.
  If it is null, that is ok, it means nobody has yet calls TlsSetValue for this value,
  so it just retains its initial 0 value. 
kernel32!lstrcmpi+0x4b3b:
7ddaf667 8b80940f0000    mov     eax,dword ptr [eax+0F94h]
7ddaf66d 85c0            test    eax,eax
7ddaf66f 74ef            je      kernel32!lstrcmpi+0x4b34 (7ddaf660)

 Index is between 64 and 1088, and there is a non null pointer at FS:F94.
 Subtract 64 from index and index into pointer there. 
 Note it does the subtraction after the multiplication, so subtracts 64*4=0x100.

kernel32!lstrcmpi+0x4b45:
7ddaf671 8b848800ffffff  mov     eax,dword ptr [eax+ecx*4-100h]
7ddaf678 e9851dfdff      jmp     kernel32!TlsGetValue+0x22 (7dd81402)

So, it is a few instructions but there is no context switch into the kernel/supervisor.

Also, calls into the kernel aren't necessarily a "context switch".
Some context is saved, and a bit is twiddled in the processor to indicate a privilege level change, but no page tables are altered and I believe no TLBs (translation lookaside buffer) are invalidated, and no thread scheduling decisions are made -- though upon exit from the kernel, APCs (asynchronous procedure call) can be run -- on the calling thread. 

A more expensive context switch is when another thread or another process runs.
Switching threads requires saving more context, and switching processes requires changing the register that points to the page tables.
One detail there -- calling into the x86 NT kernel does not preserve floating point state -- that's the additional state that a thread switch has to save, at least. NT/x86 kernel drivers aren't allowed to use floating point, with some exception, like if they are video drivers (only certain functions?) or they explicitly save/restore the floating point registers using public functions.
I don't know about the other architectures. I think IA64 only preserves some floating point state, not all.

Now, the question then is how is Upthread.getspecific implemented on other archictures and operating systems.
We should look into that for various operating systems.

Oh, also, let's see what __declspec(thread) does.

>type t.c

__declspec(thread) int a;

void F1(int);

void F2() { F1(a); }

cl -c t.c

link -dump -disasm t.obj

Dump of file t.obj

File Type: COFF OBJECT

_F2:
  00000000: 55                 push        ebp
  00000001: 8B EC              mov         ebp,esp
  00000003: A1 00 00 00 00     mov         eax,dword ptr [__tls_index]
  00000008: 64 8B 0D 00 00 00  mov         ecx,dword ptr fs:[__tls_array]
            00
  0000000F: 8B 14 81           mov         edx,dword ptr [ecx+eax*4]
  00000012: 8B 82 00 00 00 00  mov         eax,dword ptr _a[edx]
  00000018: 50                 push        eax
  00000019: E8 00 00 00 00     call        _F1
  0000001E: 83 C4 04           add         esp,4
  00000021: 5D                 pop         ebp
  00000022: C3                 ret

See the compiler generated code reference FS directly.

The optimized version is:

Dump of file t.obj

File Type: COFF OBJECT

_F2:
  00000000: A1 00 00 00 00     mov         eax,dword ptr [__tls_index]
  00000005: 64 8B 0D 00 00 00  mov         ecx,dword ptr fs:[__tls_array]
            00
  0000000C: 8B 14 81           mov         edx,dword ptr [ecx+eax*4]
  0000000F: 8B 82 00 00 00 00  mov         eax,dword ptr _a[edx]
  00000015: 50                 push        eax
  00000016: E8 00 00 00 00     call        _F1
  0000001B: 59                 pop         ecx
  0000001C: C3                 ret

 - Jay

> To: hosking at cs.purdue.edu
> Date: Fri, 17 Oct 2008 01:32:28 -0700
> From: mika at async.caltech.edu
> CC: m3devel at elegosoft.com; mika at camembert.async.caltech.edu
> Subject: Re: [M3devel] M3 programming problem : GC efficiency / per-thread storage areas?
>
> Ok I am sorry I am slow to pick up on this.
>
> I take it the problem is actually the Upthread.getspecific routine,
> which itself calls something get_curthread somewhere inside pthreads,
> which in turn involves a context switch to the supervisor---the identity
> of the current thread is just not accessible anywhere in user space.
> Also explains why this program runs faster with my old PM3, which uses
> longjmp threads.
>
> The only way to avoid it (really) is to pass a pointer to the
> Thread.T of the currently executing thread in the activation record
> of *every* procedure, so that allocators can find it when necessary....
> but that is very expensive in terms of stack memory.
>
> Or I can just make a structure like that that I pass around where
> I need it in my own program. Thread-specific and user-managed.
>
> I believe I have just answered all my own questions, but I hope
> Tony will correct me if my answers are incorrect.
>
> Mika
>
> Tony Hosking writes:
>>I suspect part of the overhead of allocation in the new code is the
>>need for thread-local allocation buffers, which means we need to
>>access thread-local state. We really need an efficient way to do
>>that, but pthreads thread-local accesses may be what is killing you.
>>
>>On 17 Oct 2008, at 00:30, Mika Nystrom wrote:
>>
>>> Hi Tony,
>>>
>>> I figured you would chime in!
>>>
>>> Yes, @M3noincremental seems to make things consistently a tad bit
>>> slower (but a very small difference), on both FreeBSD and Linux.
>>> @M3nogc makes a bigger difference, of course.
>>>
>>> Unfortunately I seem to have lost the code that did a lot of memory
>>> allocations. My tricks (as described in the email---and others!)
>>> have removed most of the troublesome memory allocations, but now
>>> I'm stuck with the mutex instead...
>>>
>>> Mika
>>>
>>> Tony Hosking writes:
>>>> Have you tried running @M3noincremental?
>>>>
>>>> On 16 Oct 2008, at 23:32, Mika Nystrom wrote:
>>>>
>>>>> Hello Modula-3 people,
>>>>>
>>>>> As I mentioned in an earlier email about printing structures (thanks
>>>>> Darko), I'm in the midst of coding an interpreter embedded in
>>>>> Modula-3. It's a Scheme interpreter, loosely based on Peter
>>>>> Norvig's
>>>>> JScheme for Java (well it was at first strongly based, but more and
>>>>> more loosely, if you know what I mean...)
>>>>>
>>>>> I expected that the performance of the interpreter would be much
>>>>> better in Modula-3 than in Java, and I have been testing on two
>>>>> different systems. One is my ancient FreeBSD-4.11 with an old PM3,
>>>>> and the other is CM3 on a recent Debian system. What I am finding
>>>>> is that it is indeed much faster than JScheme on FreeBSD/PM3
>>>>> (getting
>>>>> close to ten times as fast on some tasks at this point), but on
>>>>> Linux/CM3 it is much closer in speed to JScheme than I would like.
>>>>>
>>>>> When I started, with code that was essentially equivalent to
>>>>> JScheme,
>>>>> I found that it was a bit slower than JScheme on Linux/CM3 and
>>>>> possibly 2x as fast on FreeBSD/PM3. On Linux/CM3, it appears to
>>>>> spend most of its time in (surprise, surprise!) memory allocation
>>>>> and garbage collection. The speedup I have achieved between the
>>>>> first implementation and now was due to the use of Modula-3
>>>>> constructs
>>>>> that are superior to Java's, such as the use of arrays of RECORDs
>>>>> to make small stacks rather than linked lists. (I get readable
>>>>> code with much fewer memory allocations and GC work.)
>>>>>
>>>>> Now, since this is an interpreter, I as the implementer have limited
>>>>> control over how much memory is allocated and freed, and where it is
>>>>> needed. However, I can sometimes fall back on C-style memory
>>>>> management,
>>>>> but I would like to do it in a safe way. For instance, I have
>>>>> special-cased
>>>>> evaluation of Scheme primitives, as follows.
>>>>>
>>>>> Under the "normal" implementation, a list of things to evaluate is
>>>>> built up, passed to an evaluation function, and then the GC is left
>>>>> to sweep up the mess. The problem is that there are various tricky
>>>> routes by which references can escape the evaluator, so you can't
>>>>> just assume that what you put in is going to be dead right after
>>>>> an eval and free it. Instead, I set a flag in the evaluator, which
>>>>> is TRUE if it is OK to free the list after the eval and FALSE if
>>>>> it's unclear (in which case the problem is left up to the GC).
>>>>>
>>>>> For the vast majority of Scheme primitives, one can indeed free the
>>>>> list right after the eval. Now of course I am not interested
>>>>> in unsafe code, so what I do is this:
>>>>>
>>>>> TYPE Pair = OBJECT first, rest : REFANY; END;
>>>>>
>>>>> VAR
>>>>> mu := NEW(MUTEX);
>>>>> free : Pair := NIL;
>>>>>
>>>>> PROCEDURE GetPair() : Pair =
>>>>> BEGIN
>>>>> LOCK mu DO
>>>>> IF free # NIL THEN
>>>>> TRY
>>>>> RETURN free
>>>>> FINALLY
>>>>> free := free.rest
>>>>> END
>>>>> END
>>>>> END;
>>>>> RETURN NEW(Pair)
>>>>> END GetPair;
>>>>>
>>>>> PROCEDURE ReturnPair(cons : Pair) =
>>>>> BEGIN
>>>>> cons.first := NIL;
>>>>> LOCK mu DO
>>>>> cons.rest := free;
>>>>> free := cons
>>>>> END
>>>>> END ReturnPair;
>>>>>
>>>>> my eval code looks like
>>>>>
>>>>> VAR okToFree : BOOLEAN; BEGIN
>>>>>
>>>>> args := GetPair(); ...
>>>>> result := EvalPrimitive(args, (*VAR OUT*) okToFree);
>>>>>
>>>>> IF okToFree THEN ReturnPair(args) END;
>>>>> RETURN result
>>>>> END
>>>>>
>>>>> and this does work well. In fact it speeds up the Linux
>>>>> implementation
>>>>> by almost 100% to recycle the lists like this *just* for the
>>>>> evaluation of Scheme primitives.
>>>>>
>>>>> But it's still ugly, isn't it? There's a mutex, and a global
>>>>> variable. And yes, the time spent messing with the mutex is
>>>>> noticeable, and I haven't even made the code multi-threaded yet
>>>>> (and that is coming!)
>>>>>
>>>>> So I'm thinking, what I really want is a structure that is attached
>>>>> to my current Thread.T. I want to be able to access just a single
>>>>> pointer (like the free list) but be sure it is unique to my current
>>>>> thread. No locking would be necessary if I could do this.
>>>>>
>>>>> Does anyone have an elegant solution that does something like this?
>>>>> Thread-specific "static" variables? Just one REFANY would be enough
>>>>> for a lot of uses... seems to me this should be a frequently
>>>>> occurring problem?
>>>>>
>>>>> Best regards,
>>>>> Mika
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>