[M3commit] CVS Update: cm3

Sun Mar 14 18:23:42 CET 2010

CVSROOT:	/usr/cvs
Changes by:	jkrell at birch.	10/03/14 18:23:42

Modified files:
	cm3/m3-sys/m3back/src/: Codex86.i3 Codex86.m3 M3x86.m3 
	                        Stackx86.m3 

Log message:
	flesh out all the atomic operations,
	including 8, 16, and 64 bit operations
	
	There's some improvement to be had here still.
	
	Mainly that we overuse interlocked compare exchange loops.
	Some operations don't need that, e.g. add, sub, xor.
	Sometimes even "or" and "and" don't need them either, but
	we aren't likely to have sufficient context to discover that.
	In particular, if caller throws out the returned old value,
	then a direct "lock or" or "lock and" instruction can be used,
	instead of an interlocked compare exchange.
	
	Compare the code to these two functions:
	
	char __fastcall or8(char* a, char b) { return _InterlockedOr8(a,b); }
	void __fastcall or8(char* a, char b) { _InterlockedOr8(a,b); }
	
	The second is just two instructions including the ret.
	The first contains a loop.
	
	Also we might be able to get the zero flag more
	efficiently for some cases, e.g. the non-64 bit cases.
	We can xor a register up front, sete it, and mov into eax,
	instead of going through memory (same number of instructions
	probably, and increased register pressure).
	
	Also register allocater (procedure find) should get an explicit
	flag to force the ecx:ebx allocation (and edx:eax).
	
	We should also try inlined tests, as opposed to interface atomic,
	and verify we don't unnecessarily enregister the atomic variable's address.
	That is needed in testing so far since I've just been looking at m3core.
	(ie: more testing, not just looking at m3core)
	
	We should also perhaps implement looser load/store where we only
	have a barrier on one side instead of both.
	Presently every atomic load/store is both preceded and followed by a serializing xchg.
	
	We should also consider if the barrier variable should be one widely
	shared global instead of everyone having to waste 4 bytes of stack for it.
	
	Note that AtomicLongint__Swap doesn't return the right value currently.
	
	Note that AtomicLongint__Load/Store should maybe use interlocked compare exchange loop?
	Kind of like the funny thing AtomicLongint__Swap does?
	(AtomicLongint__Swap uses interlocked compare exchange, but picks an arbitrary
	value for the first try: 0).
	
	That is to say, AtomicLongint__Load/Store are not presently atomic!
	
	No other correctness problems known.