[M3devel] changing m3middle/m3back to have multiple passes?

Jay K jay.krell at cornell.edu
Sat Apr 3 03:34:22 CEST 2010


Good points.

 

 

I forgot to clarify that to some extent I just want to make
multiple passes over functions, not files.
  => even less memory and slow down


 

In particular, I want to know if a frame size is < 256, or < 4K or >= 4K.
The >= 4K case isn't handled correctly today I believe, though will generally work.
< 256 affords one byte of savings.


 

I'd also like to know what registers are used, for 1, 2, or 3 bytes saved.


 

While these are small, they are very broadly applicable.


 Consider that even the smallest of functions:
   PROCEDURE F(): INTEGER = BEGIN RETURN 1 END F;

compiles/compiled to something like:
  push ebp
  mov ebp, esp
  sub esp, 0 ; yes, we are willing to put zero here, and it takes 2 bytes
  push esi ; I nop this out now.
  push edi ; I nop this out now.
  push ebx ; I nop this out now.
  mov eax, 1
  pop ebx ; I optimize this out now.
  pop edi ; I optimize this out now.
  pop esi ; I optimize this out now.
  leave 0 ; I think
  ret

 

 

when, if you punt the point of frame point omission, it could be:
  push ebp
  mov ebp, esp
  mov eax, 1
  leave 0 ; I  think.
  ret


 

 or if you allow for frame pointer omission: 
   mov eax, 1 
   ret

 


(Let's assume the address is taken, therefore it can't always easily be inlined.)


"imagine if all code was multiplied out like this"

 


Omitting the frame pointer hurts stack walk and debugging, however
I think there is a midddle ground that people don't discuss:
  if a function takes no parameters and has no locals, then omit the frame pointer

 


Possibly also if it has no locals and/or makes no function calls.
 Like, if it never uses esp or ebp.


 

Actually this is encoded in PowerPC and AMD64 calling conventions, leaf

functions can optimize their prologue/epilogue. But it doesn't seem come up

with regarding x86.

 

 

There again, multiple passes help.

 


Multiple passes over a function would also offer things like
 eliminating dead stores, and cascading elimination of dead code, e.g:

  a := b + 1; (* eliminate this whole line *)
  a := b + 2;


 

Granted, I am interested in multiple passes over a file too, for inlining.

 


What I'm really going to do though is try to provide ObjFile.Move so that
at the end of the function, I can "move" the entire function up or down
a few bytes, or maybe only up. You can imagine "down" to fix the >= 4K case,
but you might as well model that as reserving room for the worst case
in first pass, and then move down usually.

 


I think it just requires that I understand relocs much better,
which is a good goal anyway for working with the code.
You know -- perform a not well motivated change to some code, for
purposes of exercise -- understanding the code and how to change it,
so that when a better motivated change comes along, we aren't stuck.

 


Also, I only recently realized/learned the high value of "sequential access"
as it relates to "working set".
Let's say you really have more data than memory?
 In a compiler we don't, but some applications do (plain text indexing of large data).

 


Then what you do is:
   To whatever extent you can, break the data down into smaller pieces.
     At some level, random access is unavoidable, say, to sort small data.
   When you can't break the data down:
    Access it linearly, not randomly: mergesort instead of quicksort.
    Minimize the number of passes, but there is tradeoff between number of
    passes and working set size.

  Linear access keeps working set "constant", even if you are slowed down
  by having to reread data from disk.

 

 

Yes I do see that message from the front end.
I haven't looked into what it means, but yeah presumably something
like a pessimistic first pass, followed by discovery of layout information,
followed by more optimistic second pass.
You might imagine, for example, knowing the sizes of opaque or object types.
The front end makes multiple passes. I suspect for significant simplicity
or "possibility" -- it might be darn difficult to implement in fewer passes,
without contorting the code a lot.


 

If you look how how "patching" and "labels" work in the backend, that's probably
an example of where multiple passes would be more general and simpler,
but where the "contortions" are well known and not too bad.


 

It uses "patching" to fill in the frame size for example, and I now use
that to nop out unnecessary prolog instructions (on the theory that
nop it is a small optimization).


 

Though, again, patching isn't sufficient for >4K frames.
Very long standing bug there, I have to confirm and fix soon.


 

Good points about whole program optimization.

 


I think there are ways to address this. Not any time soon.
Doing optimization and codegen at "link time" I think is somewhat of a hack
in these systems. One we can't easily play into, as you point out.
However, where we use the gcc backend and linker, we probably can, easily.
 That is many platforms.
 gcc does have this option.

 


Also we could optimize "among" all the *.m3 files, just lose out on any M3<=>C optimizations.

 


In reality, let me first think about NTObjFile.Move as a way to do some small optimizations,
NTObjFile CodeView generation, maybe inlining, maybe LLVM.
Testing/optimizing the atomics (add,sub,xor can improve, also store and probably load).
Everything else is too much for the time being.

 


Maybe we can do some target independent optimizations in m3front or m3front/cg or m3middle as well.
Inlining could be done there, but the metrics of when to do it aren't really available.
Only for *extremely* cheap seeming functions could the front end make a reliable target-independent choice.

Oh wait -- that is interesting.

 


If you consider just the case of inlining functions whose definitions precede use,
the backend could "measure" a function, after generating it, and report back to the frontend
if it should strongly consider inlining it. Then upon subsequent calls, the frontend could
do the inlining.

Such measurement is always going to be heuristic, but you can imagine stuff like
 number of cg calls
 number of parameters
 does it return a value

 

 


As well, if had a <* FORCEINLINE *> pragma, the frontend could implement that.
I'm pretty sure.
Not clear if better to leave the decision to the compiler or not.
After all, C and C++ have "inline" and "__inline" that are just hints
that modern compilers ignore, and Visual C++ has "__forceinline", which
generates a warning when not honored. Seems to be a bit of an arms race
in terms of who things he knows better, programmer or compiler.
Used to always be the programmer, then shifted to compiler, now not clear.


 

 

 - Jay






 
> Date: Sat, 3 Apr 2010 21:20:57 -0400
> From: hendrik at topoi.pooq.com
> To: m3devel at elegosoft.com
> Subject: Re: [M3devel] changing m3middle/m3back to have multiple passes?
> 
> On Thu, Apr 01, 2010 at 03:10:04PM +0000, Jay K wrote:
> > 
> > However memory capacity has increased tremendously.
> > 
> > 
> > 
> > 
> > 
> > Whereas I'm only thinking about multiple passes, while still compiling 
> > one file at a time,
> > 
> > other compilation systems have capitalized on larger memories and now 
> > compile entire programs at once, including multiple passes.
> > 
> > This goes by multiple names -- "whole program optimization", "link 
> > time codegen (LTCG)".
> 
> Link-time code generation was already involved in Mark Rain's Mary 2 
> compiler in the early 80's. It didn't actually require today's huge 
> memories.
> 
> Every now and then the Modula 3 compiler seems to decide to recompile 
> something because new source had been discovered, or soething like that. 
> I suspect this is because a new .m3 file changes the memory layout that 
> was assumed in aanother source file, so that other one needs to be 
> reconsidered. Is the complexity of tracking that simpler or more 
> complex than doing whole-program compilation?
> 
> It's possible that file-by-file compilation is a simpler match for 
> interacting with foreign-language compilations, theough. It two 
> languages insist on only compiling whole programs, it makes 
> mixed-language programming difficult.
> 
> -- hendrik
 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://m3lists.elegosoft.com/pipermail/m3devel/attachments/20100403/d0f54f24/attachment-0002.html>


More information about the M3devel mailing list