<html>

<head>

<style>

.hmmessage P

{

margin:0px;

padding:0px

}

body.hmmessage

{

font-size: 10pt;

font-family:Verdana

}

</style>

</head>

<body class='hmmessage'>

[truncated again]<BR><BR><BR><BR><BR><BR> <BR>

<HR id=stopSpelling>

From: jay.krell@cornell.edu<BR>To: hosking@cs.purdue.edu<BR>CC: m3devel@elegosoft.com<BR>Subject: RE: [M3devel] per thread data?<BR>Date: Tue, 31 Mar 2009 00:12:25 +0000<BR><BR>

<STYLE>

.ExternalClass .EC_hmmessage P

{padding:0px;}

.ExternalClass body.EC_hmmessage

{font-size:10pt;font-family:Verdana;}

</STYLE>

> > but I don't know for sure. I've never really liked the idea of <BR>> > having non-M3 threads.<BR><BR>I understand there is no free lunch, but the scenario is that I write a "plugin" in Modula-3.<BR>Or even a static dependency -- the point being, to mix languages and have the "primary" language not be Modula-3. For folks to be able to call "native" pthread_create or Win32 CreateThread, and still be able to use Modula-3.<BR> <BR>On Win32, .dlls have a callback that gets called for every thread created.<BR>Generally they can initializer their per-thread data there.<BR>There is also a callback for thread exit.<BR>It is a slightly thorny issue though, for a few reasons.<BR>For example, if .dlls get dynamically loaded/unloaded, threads can be created before they load -- no callback, or threads can be exited after they unload -- again, no callback.<BR>You only get callbacks when you are already loaded at the time of thread create/exit.<BR> <BR>You can also initialize on demand, assuming there is enough memory still.<BR> <BR>If the primary executable is not written in Modula-3, but has a static dependency on a Modula-3 .dll, then it works ok.<BR> <BR> - Jay<BR><BR> <BR>> From: hosking@cs.purdue.edu<BR>> To: hosking@cs.purdue.edu<BR>> Date: Tue, 31 Mar 2009 09:45:22 +1100<BR>> CC: m3devel@elegosoft.com; jay.krell@cornell.edu<BR>> Subject: Re: [M3devel] per thread data?<BR>> <BR>> PS In general, I am loathe to make changes that complicate the code <BR>> based on performance assumptions that are only hypothetical. Better <BR>> to profile and see where the time is going before prematurely <BR>> "optimizing".<BR>> <BR>> On 31 Mar 2009, at 09:42, Tony Hosking wrote:<BR>> <BR>> > Yes, this is a tricky issue. At some point I seem to recall it <BR>> > being OK to have non-Modula-3 threads start running Modula-3 code, <BR>> > but I don't know for sure. I've never really liked the idea of <BR>> > having non-M3 threads.<BR>> ><BR>> > Are you using the existing handler maps and exception stack <BR>> > unwinding support for non-x86 NT?<BR>> ><BR>> > On 31 Mar 2009, at 09:15, Jay wrote:<BR>> ><BR>> >><BR>> >> hm, thinking about this more...<BR>> >> What about threads not created by Modula-3 Fork() (or the first <BR>> >> thread)?<BR>> >><BR>> >> It looks like exception handling had a chance of working on them<BR>> >> before. Now they'll crash upon entering functions<BR>> >> with try or raise or I presume lock.<BR>> >><BR>> >><BR>> >> 1) ok?<BR>> >><BR>> >><BR>> >> 2) do the heap alloc on demand?<BR>> >> But is that enough? Can it be initialized without further context?<BR>> >> Let's see..the circular list can be maintained without further <BR>> >> context.<BR>> >> handle := pthread_self, ok. stack can probably be figured out, though<BR>> >> that is probably just for gc and could be left alone for now, <BR>> >> continuing<BR>> >> to not work (or fixed)...getcontext at least on some platforms can<BR>> >> fill this in, or VirtualQuery/msomething (mmap family?)?<BR>> >><BR>> >><BR>> >> 3) put back the second thread local?<BR>> >><BR>> >><BR>> >> #2 has a chance of working better than before -- letting GC<BR>> >> work on threads not created by Modula-3 runtime, something<BR>> >> that has long bothered me...but I haven't done a complete analysis.<BR>> >> Or at least maybe keep it working as it was<BR>> >> For now there is somewhat of a regression, ie, when calling<BR>> >> Modula-3 code on threads not created from Modula-3.<BR>> >> Possibly the gc in this case was already dangerous?<BR>> >> Failing to find references on other stacks?<BR>> >> Or failing all allocations (should be easy to check but I have to <BR>> >> run..)<BR>> >><BR>> >><BR>> >> - Jay<BR>> >><BR>> >><BR>> >><BR>> >><BR>> >><BR>> >><BR>> >><BR>> >><BR>> >><BR>> >><BR>> >><BR>> >> ----------------------------------------<BR>> >>> From: jay.krell@cornell.edu<BR>> >>> To: hosking@cs.purdue.edu<BR>> >>> CC: m3devel@elegosoft.com<BR>> >>> Subject: RE: [M3devel] per thread data?<BR>> >>> Date: Mon, 30 Mar 2009 13:23:10 +0000<BR>> >>><BR>> >>><BR>> >>> This was surprisingly difficult.<BR>> >>><BR>> >>><BR>> >>> InitHandlers is called much earlier than InitActivations.<BR>> >>> InitActivations does a heap allocation.<BR>> >>> InitHandlers did not.<BR>> >>> The types involved are not yet initialized at this point, or <BR>> >>> somesuch.<BR>> >>> You cannot NEW(Activation) in the first call to PushFrame.<BR>> >>> So, maybe, use a global for the first one,<BR>> >>> but then what happens is it gets reinitialized later by<BR>> >>> the module initializer -- which is perhaps another indictment<BR>> >>> of initializers..or maybe a special case in the depths of the <BR>> >>> system --<BR>> >>> this module and anything it uses are subject to be called by<BR>> >>> compiler-generated calls -- they can be called before their <BR>> >>> initializers<BR>> >>> run.. seems to me the initialization could have happened <BR>> >>> "statically"<BR>> >>> like in C.<BR>> >>><BR>> >>><BR>> >>> Anyway, I should have this done shortly.<BR>> >>> Trick is to use a local value and assign it to a heap block<BR>> >>> allocated directly with calloc instead of RTAllocator.<BR>> >>><BR>> >>><BR>> >>> The result is maybe faster, maybe slower.<BR>> >>> Before, "try" cost pthread_getspecific and setspecific.<BR>> >>> Now it will just cost getspecific.<BR>> >>> But with another pointer deref and call to GetActivation<BR>> >>> with its on-demand initialization.<BR>> >>><BR>> >>><BR>> >>> Before, popframe only called setspecific.<BR>> >>> Now it will only call getspecific, plus the indirect<BR>> >>> and on-demand initialization.<BR>> >>> The on-demand seems bogus in pop, given that push already had to <BR>> >>> occur.<BR>> >>> So maybe that could be optimized.<BR>> >>><BR>> >>><BR>> >>> This stuff is highly optimized in C and C++ on NT..<BR>> >>> NT/x86 has a special thread local just for exception handling,<BR>> >>> faster than all other thread locals.<BR>> >>> All non-x86 NT platforms have stack walkers -- no cost for "try",<BR>> >>> and then "throw" maps instruction pointer to data about how to<BR>> >>> to unwind the stack, using a little mini-assembly code.<BR>> >>><BR>> >>><BR>> >>> - Jay<BR>> >>><BR>> >>><BR>> >>> ________________________________<BR>> >>>> From: jay.krell@cornell.edu<BR>> >>>> To: hosking@cs.purdue.edu<BR>> >>>> Date: Thu, 19 Mar 2009 01:03:57 +0000<BR>> >>>> CC: m3devel@elegosoft.com<BR>> >>>> Subject: Re: [M3devel] per thread data?<BR>> >>>><BR>> >>>><BR>> >>>><BR>> >>>><BR>> >>>><BR>> >>>><BR>> >>>><BR>> >>>><BR>> >>>> Thanks, I should get around to that "soon" then.<BR>> >>>><BR>> >>>><BR>> >>>><BR>> >>>> - Jay<BR>> >>>><BR>> >>>><BR>> >>>><BR>> >>>> ________________________________<BR>> >>>><BR>> >>>> From: hosking@cs.purdue.edu<BR>> >>>> To: jay.krell@cornell.edu<BR>> >>>> Date: Thu, 19 Mar 2009 10:14:59 +1100<BR>> >>>> CC: m3devel@elegosoft.com<BR>> >>>> Subject: Re: [M3devel] per thread data?<BR>> >>>><BR>> >>>> I have no problem putting the exception handler stack thread <BR>> >>>> local into the activation thread local.<BR>> >>>><BR>> >>>><BR>> >>>><BR>> >>>><BR>> >>>><BR>> >>>><BR>> >>>> On 18 Mar 2009, at 20:11, Jay wrote:<BR>> >>>><BR>> >>>><BR>> >>>><BR>> >>>> I'm not looking at it right now, but doesn't seem rather piggy to <BR>> >>>> have two thread locals and data on the side?<BR>> >>>><BR>> >>>><BR>> >>>> I'm guessing the data on the side is needed because we need to be <BR>> >>>> able to enumerate our threads, to suspend them all?<BR>> >>>><BR>> >>>><BR>> >>>> I understand that having multiple thread locals optimizes their <BR>> >>>> use, but it seems greedy.<BR>> >>>> vs. a small heap allocation that combines them.<BR>> >>>><BR>> >>>> Or in fact.. presumably there could just be one thread local that <BR>> >>>> is the thread pointer, and the handler link could be put at the <BR>> >>>> start, for architectures where zero offset is smaller/faster than <BR>> >>>> non-zero offset.<BR>> >>>><BR>> >>>><BR>> >>>> Another idea, of course, is to look into "__thread", <BR>> >>>> "__declspec(thread)".<BR>> >>>><BR>> >>>> On Windows and probably all platforms they exist on, they are <BR>> >>>> nicely more efficient than pthread_get/setspecific, except on <BR>> >>>> Windows they don't really work acceptably prior to Vista -- they <BR>> >>>> only work in .exes and their static dependencies, not any .dll <BR>> >>>> you load after the process starts with LoadLibrary (dlopen).<BR>> >>>><BR>> >>>><BR>> >>>> Does "__thread" work well on most non-Windows platforms?<BR>> >>>> i.e. even if shared object is loaded with dlopen?<BR>> >>>><BR>> >>>><BR>> >>>> I could have sworn I saw code out there that was "adaptive".<BR>> >>>> It easily/efficiently checked if it was loaded with LoadLibrary <BR>> >>>> or not.<BR>> >>>> If so, it'd TlsGet/SetValue (pthread_get/setspecific).<BR>> >>>> If not, it'd use __declspec(thread) (__thread).<BR>> >>>> The check was based on if __tlsindex was not zero or somesuch. I <BR>> >>>> couldn't track it down though.<BR>> >>>><BR>> >>>><BR>> >>>> In either case, yes, I know, one of the thread locals at least is <BR>> >>>> gone on platforms that have stack walkers, e.g. Solaris, and <BR>> >>>> potentially NT, and maybe others.<BR>> >>>><BR>> >>>><BR>> >>>> - Jay<BR>> >>>><BR>> >>>><BR>> ]<BR></body>

</html>