[M3devel] per thread data?

Tony Hosking hosking at cs.purdue.edu
Tue Mar 31 00:45:22 CEST 2009


PS  In general, I am loathe to make changes that complicate the code  
based on performance assumptions that are only hypothetical.  Better  
to profile and see where the time is going before prematurely  
"optimizing".

On 31 Mar 2009, at 09:42, Tony Hosking wrote:

> Yes, this is a tricky issue.  At some point I seem to recall it  
> being OK to have non-Modula-3 threads start running Modula-3 code,  
> but I don't know for sure.  I've never really liked the idea of  
> having non-M3 threads.
>
> Are you using the existing handler maps and exception stack  
> unwinding support for non-x86 NT?
>
> On 31 Mar 2009, at 09:15, Jay wrote:
>
>>
>> hm, thinking about this more...
>> What about threads not created by Modula-3 Fork() (or the first  
>> thread)?
>>
>> It looks like exception handling had a chance of working on them
>> before. Now they'll crash upon entering functions
>> with try or raise or I presume lock.
>>
>>
>> 1) ok?
>>
>>
>> 2) do the heap alloc on demand?
>> But is that enough? Can it be initialized without further context?
>> Let's see..the circular list can be maintained without further  
>> context.
>> handle := pthread_self, ok. stack can probably be figured out, though
>> that is probably just for gc and could be left alone for now,  
>> continuing
>> to not work (or fixed)...getcontext at least on some platforms can
>> fill this in, or VirtualQuery/msomething (mmap family?)?
>>
>>
>> 3) put back the second thread local?
>>
>>
>> #2 has a chance of working better than before -- letting GC
>> work on threads not created by Modula-3 runtime, something
>> that has long bothered me...but I haven't done a complete analysis.
>> Or at least maybe keep it working as it was
>> For now there is somewhat of a regression, ie, when calling
>> Modula-3 code on threads not created from Modula-3.
>> Possibly the gc in this case was already dangerous?
>> Failing to find references on other stacks?
>> Or failing all allocations (should be easy to check but I have to  
>> run..)
>>
>>
>> - Jay
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ----------------------------------------
>>> From: jay.krell at cornell.edu
>>> To: hosking at cs.purdue.edu
>>> CC: m3devel at elegosoft.com
>>> Subject: RE: [M3devel] per thread data?
>>> Date: Mon, 30 Mar 2009 13:23:10 +0000
>>>
>>>
>>> This was surprisingly difficult.
>>>
>>>
>>> InitHandlers is called much earlier than InitActivations.
>>> InitActivations does a heap allocation.
>>> InitHandlers did not.
>>> The types involved are not yet initialized at this point, or  
>>> somesuch.
>>> You cannot NEW(Activation) in the first call to PushFrame.
>>> So, maybe, use a global for the first one,
>>> but then what happens is it gets reinitialized later by
>>> the module initializer -- which is perhaps another indictment
>>> of initializers..or maybe a special case in the depths of the  
>>> system --
>>> this module and anything it uses are subject to be called by
>>> compiler-generated calls -- they can be called before their  
>>> initializers
>>> run.. seems to me the initialization could have happened  
>>> "statically"
>>> like in C.
>>>
>>>
>>> Anyway, I should have this done shortly.
>>> Trick is to use a local value and assign it to a heap block
>>> allocated directly with calloc instead of RTAllocator.
>>>
>>>
>>> The result is maybe faster, maybe slower.
>>> Before, "try" cost pthread_getspecific and setspecific.
>>> Now it will just cost getspecific.
>>> But with another pointer deref and call to GetActivation
>>> with its on-demand initialization.
>>>
>>>
>>> Before, popframe only called setspecific.
>>> Now it will only call getspecific, plus the indirect
>>> and on-demand initialization.
>>> The on-demand seems bogus in pop, given that push already had to  
>>> occur.
>>> So maybe that could be optimized.
>>>
>>>
>>> This stuff is highly optimized in C and C++ on NT..
>>> NT/x86 has a special thread local just for exception handling,
>>> faster than all other thread locals.
>>> All non-x86 NT platforms have stack walkers -- no cost for "try",
>>> and then "throw" maps instruction pointer to data about how to
>>> to unwind the stack, using a little mini-assembly code.
>>>
>>>
>>> - Jay
>>>
>>>
>>> ________________________________
>>>> From: jay.krell at cornell.edu
>>>> To: hosking at cs.purdue.edu
>>>> Date: Thu, 19 Mar 2009 01:03:57 +0000
>>>> CC: m3devel at elegosoft.com
>>>> Subject: Re: [M3devel] per thread data?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks, I should get around to that "soon" then.
>>>>
>>>>
>>>>
>>>> - Jay
>>>>
>>>>
>>>>
>>>> ________________________________
>>>>
>>>> From: hosking at cs.purdue.edu
>>>> To: jay.krell at cornell.edu
>>>> Date: Thu, 19 Mar 2009 10:14:59 +1100
>>>> CC: m3devel at elegosoft.com
>>>> Subject: Re: [M3devel] per thread data?
>>>>
>>>> I have no problem putting the exception handler stack thread  
>>>> local into the activation thread local.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 18 Mar 2009, at 20:11, Jay wrote:
>>>>
>>>>
>>>>
>>>> I'm not looking at it right now, but doesn't seem rather piggy to  
>>>> have two thread locals and data on the side?
>>>>
>>>>
>>>> I'm guessing the data on the side is needed because we need to be  
>>>> able to enumerate our threads, to suspend them all?
>>>>
>>>>
>>>> I understand that having multiple thread locals optimizes their  
>>>> use, but it seems greedy.
>>>> vs. a small heap allocation that combines them.
>>>>
>>>> Or in fact.. presumably there could just be one thread local that  
>>>> is the thread pointer, and the handler link could be put at the  
>>>> start, for architectures where zero offset is smaller/faster than  
>>>> non-zero offset.
>>>>
>>>>
>>>> Another idea, of course, is to look into "__thread",  
>>>> "__declspec(thread)".
>>>>
>>>> On Windows and probably all platforms they exist on, they are  
>>>> nicely more efficient than pthread_get/setspecific, except on  
>>>> Windows they don't really work acceptably prior to Vista -- they  
>>>> only work in .exes and their static dependencies, not any .dll  
>>>> you load after the process starts with LoadLibrary (dlopen).
>>>>
>>>>
>>>> Does "__thread" work well on most non-Windows platforms?
>>>> i.e. even if shared object is loaded with dlopen?
>>>>
>>>>
>>>> I could have sworn I saw code out there that was "adaptive".
>>>> It easily/efficiently checked if it was loaded with LoadLibrary  
>>>> or not.
>>>> If so, it'd TlsGet/SetValue (pthread_get/setspecific).
>>>> If not, it'd use __declspec(thread) (__thread).
>>>> The check was based on if __tlsindex was not zero or somesuch. I  
>>>> couldn't track it down though.
>>>>
>>>>
>>>> In either case, yes, I know, one of the thread locals at least is  
>>>> gone on platforms that have stack walkers, e.g. Solaris, and  
>>>> potentially NT, and maybe others.
>>>>
>>>>
>>>> - Jay
>>>>
>>>>




More information about the M3devel mailing list