[M3devel] frame per procedure instead of frame per TRY?

Jay K jay.krell at cornell.edu
Sun Jul 19 12:19:13 CEST 2015


NT/x86 is the slow one.Still much faster than Modula-3.

There is a linked list through fs:0.fs:0 is thread local.For code speed, the link/unlink can be inlined.For code size, it can be a small function call.Just like two instructions for function enter and exit.And set a local volatile "as scopes are crossed".

locals used in the except/finallyblock are not likely enregistered across calls.

Compare this with current Modula-3:

 pthread_getspecific (TlsGetValue)  to get the current head  link it in  setjmp 
  And this happens for every TRY, instead of just at most once per function. 
The fs:0 link/unlink is at most once per function.

And all the other NT platforms are faster.

They don't link/unlink anything.They have metadata describing prologs.The runtime can use that to restore nonvolatile registers (includingthe stack) at any point.The codegen is somewhat constrained -- to be describable,but I suspect what you can describe encompasses anythinga compiler would want to do.Leaf functions have no data, and can't change nonvolatile registers,including rsp, and they can't make any calls (which would change rsp).

The tables are found from the return address.The only dynamic data the runtime has to leave aroundis the actual return address. No linked list, no volatile localindicating position in the function.

fs:0 is the NT/x86 location.This is a highly optimized thread local (fiber local actually).I don't know what other ABIs use, if anything -- again, all the otherNT platforms have no linked list, just return addresses and metadata.


Notes:The non-x86 approach is sometimes referred to as "no overhead", as "TRY" doesn't do anything (exceptleave cold data around).X86 exception dispatch is faster than non-x86. The stack is faster to walk, through the fs:0 linked list.The premise is that exception dispatch can be slow.The non exceptional paths are what should be optimized.And again, even NT/x86 is much more optimized than what Modula-3 does.


 - Jay



Date: Sun, 19 Jul 2015 12:06:20 +0200
From: estellnb at elstel.org
To: jay.krell at cornell.edu; m3devel at elegosoft.com
Subject: Re: [M3devel] frame per procedure instead of frame per TRY?


  
    
  
  
    

    Am 2015-07-19 um 11:38 schrieb Elmar
      Stellnberger:

    
    
      
      

      Am 2015-07-19 um 11:10 schrieb Jay K:

      
      
        
        I'm pretty sure it can work, but you need also a
          local "dense" volatile integer that describes where in the
          function you are.  That isn't free, but it is much cheaper
          than calling setjmp/PushFrame for each try.
          

          
        
      
      Is it really that much faster? I can remember having implemented
      my own setjump/longjump in assembly some time ago and it should
      only save you one procedure call but generate some additional
      jumps. However I do not know how time costly the new-fashioned
      register value obfuscation is (registers are no more stored as
      they are but obfuscated for security reasons by glibc). Xor-ing by
      a simple value; does it really cost the world? I am not the one
      who can tell you whether such a venture like this would pay off
      ...

      

      

    
    

    You are right. It would be somewhat faster especially on AMD64 where
    we have a lot of registers to rescue ...

    

    

    
      
        
          

          
          Try writing similar C++ for NT/x86 and look at what you
            get.
          "PushFrame" is highly optimized to build a linked list
            through fs:0.
          And even that is only done at most once per function.
        
      
      

      Through fs:0 ? It used to be on the ss:[e/r]b in former times.

      Since pthreading it may also be fs:0 under Linux because of
      get/setspecific.

      I am not sure what these functions do in detail (something with fs
      at last).

      

      Nonetheless I would believe that avoiding to call get/setspecific
      could speed

      things up noticeably. First there is the function overhead, second
      we need to

      touch an independent memory area and last but not least the stack
      is 

      always thread local. However I am not sure on how we could place
      the top

      anchor for the linked list of exception frames otherwise. Push an
      exception

       frame pointer into every local variable area?

    
    

    However I believe this should also be worth a consideration as soon
    as we talk about m3cg support and speed.

    

    
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20150719/d27e2d99/attachment-0002.html>


More information about the M3devel mailing list