[M3devel] per thread data?
Jay
jay.krell at cornell.edu
Tue Mar 31 02:27:01 CEST 2009
I understand, but pushframe/popframe seem so icky..
I like to try to keep things pretty ok as I go, but I realize your opinion is the more often stated one -- don't optimize prematurely.
- Jay
> From: hosking at cs.purdue.edu
> To: hosking at cs.purdue.edu
> Date: Tue, 31 Mar 2009 09:45:22 +1100
> CC: m3devel at elegosoft.com; jay.krell at cornell.edu
> Subject: Re: [M3devel] per thread data?
>
> PS In general, I am loathe to make changes that complicate the code
> based on performance assumptions that are only hypothetical. Better
> to profile and see where the time is going before prematurely
> "optimizing".
>
> On 31 Mar 2009, at 09:42, Tony Hosking wrote:
>
> > Yes, this is a tricky issue. At some point I seem to recall it
> > being OK to have non-Modula-3 threads start running Modula-3 code,
> > but I don't know for sure. I've never really liked the idea of
> > having non-M3 threads.
> >
> > Are you using the existing handler maps and exception stack
> > unwinding support for non-x86 NT?
> >
> > On 31 Mar 2009, at 09:15, Jay wrote:
> >
> >>
> >> hm, thinking about this more...
> >> What about threads not created by Modula-3 Fork() (or the first
> >> thread)?
> >>
> >> It looks like exception handling had a chance of working on them
> >> before. Now they'll crash upon entering functions
> >> with try or raise or I presume lock.
> >>
> >>
> >> 1) ok?
> >>
> >>
> >> 2) do the heap alloc on demand?
> >> But is that enough? Can it be initialized without further context?
> >> Let's see..the circular list can be maintained without further
> >> context.
> >> handle := pthread_self, ok. stack can probably be figured out, though
> >> that is probably just for gc and could be left alone for now,
> >> continuing
> >> to not work (or fixed)...getcontext at least on some platforms can
> >> fill this in, or VirtualQuery/msomething (mmap family?)?
> >>
> >>
> >> 3) put back the second thread local?
> >>
> >>
> >> #2 has a chance of working better than before -- letting GC
> >> work on threads not created by Modula-3 runtime, something
> >> that has long bothered me...but I haven't done a complete analysis.
> >> Or at least maybe keep it working as it was
> >> For now there is somewhat of a regression, ie, when calling
> >> Modula-3 code on threads not created from Modula-3.
> >> Possibly the gc in this case was already dangerous?
> >> Failing to find references on other stacks?
> >> Or failing all allocations (should be easy to check but I have to
> >> run..)
> >>
> >>
> >> - Jay
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> ----------------------------------------
> >>> From: jay.krell at cornell.edu
> >>> To: hosking at cs.purdue.edu
> >>> CC: m3devel at elegosoft.com
> >>> Subject: RE: [M3devel] per thread data?
> >>> Date: Mon, 30 Mar 2009 13:23:10 +0000
> >>>
> >>>
> >>> This was surprisingly difficult.
> >>>
> >>>
> >>> InitHandlers is called much earlier than InitActivations.
> >>> InitActivations does a heap allocation.
> >>> InitHandlers did not.
> >>> The types involved are not yet initialized at this point, or
> >>> somesuch.
> >>> You cannot NEW(Activation) in the first call to PushFrame.
> >>> So, maybe, use a global for the first one,
> >>> but then what happens is it gets reinitialized later by
> >>> the module initializer -- which is perhaps another indictment
> >>> of initializers..or maybe a special case in the depths of the
> >>> system --
> >>> this module and anything it uses are subject to be called by
> >>> compiler-generated calls -- they can be called before their
> >>> initializers
> >>> run.. seems to me the initialization could have happened
> >>> "statically"
> >>> like in C.
> >>>
> >>>
> >>> Anyway, I should have this done shortly.
> >>> Trick is to use a local value and assign it to a heap block
> >>> allocated directly with calloc instead of RTAllocator.
> >>>
> >>>
> >>> The result is maybe faster, maybe slower.
> >>> Before, "try" cost pthread_getspecific and setspecific.
> >>> Now it will just cost getspecific.
> >>> But with another pointer deref and call to GetActivation
> >>> with its on-demand initialization.
> >>>
> >>>
> >>> Before, popframe only called setspecific.
> >>> Now it will only call getspecific, plus the indirect
> >>> and on-demand initialization.
> >>> The on-demand seems bogus in pop, given that push already had to
> >>> occur.
> >>> So maybe that could be optimized.
> >>>
> >>>
> >>> This stuff is highly optimized in C and C++ on NT..
> >>> NT/x86 has a special thread local just for exception handling,
> >>> faster than all other thread locals.
> >>> All non-x86 NT platforms have stack walkers -- no cost for "try",
> >>> and then "throw" maps instruction pointer to data about how to
> >>> to unwind the stack, using a little mini-assembly code.
> >>>
> >>>
> >>> - Jay
> >>>
> >>>
> >>> ________________________________
> >>>> From: jay.krell at cornell.edu
> >>>> To: hosking at cs.purdue.edu
> >>>> Date: Thu, 19 Mar 2009 01:03:57 +0000
> >>>> CC: m3devel at elegosoft.com
> >>>> Subject: Re: [M3devel] per thread data?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Thanks, I should get around to that "soon" then.
> >>>>
> >>>>
> >>>>
> >>>> - Jay
> >>>>
> >>>>
> >>>>
> >>>> ________________________________
> >>>>
> >>>> From: hosking at cs.purdue.edu
> >>>> To: jay.krell at cornell.edu
> >>>> Date: Thu, 19 Mar 2009 10:14:59 +1100
> >>>> CC: m3devel at elegosoft.com
> >>>> Subject: Re: [M3devel] per thread data?
> >>>>
> >>>> I have no problem putting the exception handler stack thread
> >>>> local into the activation thread local.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 18 Mar 2009, at 20:11, Jay wrote:
> >>>>
> >>>>
> >>>>
> >>>> I'm not looking at it right now, but doesn't seem rather piggy to
> >>>> have two thread locals and data on the side?
> >>>>
> >>>>
> >>>> I'm guessing the data on the side is needed because we need to be
> >>>> able to enumerate our threads, to suspend them all?
> >>>>
> >>>>
> >>>> I understand that having multiple thread locals optimizes their
> >>>> use, but it seems greedy.
> >>>> vs. a small heap allocation that combines them.
> >>>>
> >>>> Or in fact.. presumably there could just be one thread local that
> >>>> is the thread pointer, and the handler link could be put at the
> >>>> start, for architectures where zero offset is smaller/faster than
> >>>> non-zero offset.
> >>>>
> >>>>
> >>>> Another idea, of course, is to look into "__thread",
> >>>> "__declspec(thread)".
> >>>>
> >>>> On Windows and probably all platforms they exist on, they are
> >>>> nicely more efficient than pthread_get/setspecific, except on
> >>>> Windows they don't really work acceptably prior to Vista -- they
> >>>> only work in .exes and their static dependencies, not any .dll
> >>>> you load after the process starts with LoadLibrary (dlopen).
> >>>>
> >>>>
> >>>> Does "__thread" work well on most non-Windows platforms?
> >>>> i.e. even if shared object is loaded with dlopen?
> >>>>
> >>>>
> >>>> I could have sworn I saw code out there that was "adaptive".
> >>>> It easily/efficiently checked if it was loaded with LoadLibrary
> >>>> or not.
> >>>> If so, it'd TlsGet/SetValue (pthread_get/setspecific).
> >>>> If not, it'd use __declspec(thread) (__thread).
> >>>> The check was based on if __tlsindex was not zero or somesuch. I
> >>>> couldn't track it down though.
> >>>>
> >>>>
> >>>> In either case, yes, I know, one of the thread locals at least is
> >>>> gone on platforms that have stack walkers, e.g. Solaris, and
> >>>> potentially NT, and maybe others.
> >>>>
> >>>>
> >>>> - Jay
> >>>>
> >>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20090331/2f20750c/attachment-0002.html>
More information about the M3devel
mailing list