[M3devel] SIGCHLD diff

Thu Feb 17 23:56:05 CET 2011

Hi Olaf,

I have committed the SIGCHLD diff and run quite heavy jobs with it 
in the code, and so far I have not seen any problems.  I am fairly sure
there won't be any.

Can you show me how to integrate the test program?  Note that when it fails
it quite often (usually) hangs rather than producing any specific error
output.

I have also definitely run into pthreads problems on I386_DARWIN now,
not running the threadtest program.  They show up when I enable the
parallel back-end builds (I suspect they will do that on all architectures
with pthreads---I've rebuilt all my CM3 installations with user threads).

    Mika

Olaf Wagner writes:
>Yes, please commit changes that improve matters, even if they are not
>perfect. Perhaps Tony will review your latest suggestion again?
>
>Your thread test program should be integrated into the Hudson regression
>test framework for the cm3 current line; I don't think it is already.
>That way we would get an overview of the thread state for all platforms
>at least once a day. It would be great if the tests would show up
>in the cm3-current-test-m3tests-TARGET jobs; it shouldn't be too
>difficult. Basically, only one script should need to be extended, and
>some XML result generation added. IIRC it's in cm3/scripts/regression/defs.s=
>h
>in test_m3tests(), or (in quake) in the m3tests m3makefile.
>
>Can you do that? If not, I may have a look myself at the weekend...
>
>Olaf
>
>PS: We really should get all our code base into a state that you can
>run all your applications without any problems on current. After all
>there aren't that many active M3 users...
>
>PPS: I need to return to my offline testing of LDAP protocol handlers
>for online charging servers now :-/
>
>Quoting Mika Nystrom <mika at async.caltech.edu>:
>
>> Hi again m3devel,
>>
>> Does anyone understand the code in ThreadPosix.m3?
>>
>> After looking through it again I think my attached solution (sent out
>> yesterday afternoon) is right, but I'm still just a little bit fuzzy on
>> how the code works (is supposed to work) but I have some hypotheses.
>>
>> I realized it's been a long, long, loooong time since I coded anything
>> with signals and signal handlers and I've forgotten much of the semantics.
>> After all since Concurrent Pascal came out in '76 signals has always been
>> the "old" way of doing things, right?  (And I quit programming C in favor
>> of M3 in 1999...)
>>
>> In ThreadPosixC.c, there's a function called setup_sigvtalrm that
>> registers the Modula-3 procedure switch_thread as a signal handler.
>>
>> My understanding is that in C, signals are masked atomically on the
>> call into the signal handler and unmasked atomically when the handler
>> returns (unless the signal mask is changed, etc., while in the handler).
>>
>> However, switch_thread... doesn't return?  Or?  Or it does a longjmp
>> somewhere?  Which means signals may not be unmasked(?)  Which is the reaso=
>n
>> for the calling "allow_sigvtalrm" as the first statement of the signal
>> handler?
>>
>> Now I can't just turn on ALL signals in the signal handler because then
>> I may get very deeply nested signal handlers, so I only call allow_allsign=
>als
>> (or whatever we call it) from the first-level signal handler.  Hmm.
>> Maybe that is not quite right?  Do we eventually get back there?  Yes
>> I think we must because certain things depend on inCritical being zero...
>> and when inCritical is zero SIGCHLD is again enabled.
>>
>> No I still don't fully understand how the code is supposed to work,
>> especially in the presence of nested signal handling.  Does anyone
>> understand better?
>>
>> (I am still going to try to commit my patch later today because I don't
>> think it could possibly make things worse and it does make a significant
>> real-world improvement to the compiler performance under user threads.)
>>
>>     Mika
>>
>> Mika Nystrom writes:
>>> Hi Tony,
>>>
>>> I found a problem with the change.  I would appreciate it if you and
>>> Jay could think about what I found and my solution (and anybody else
>>> is welcome to as well, of course).
>>>
>>> When fiddling with cm3 to get it to run in parallel, I pushed it up to
>>> fork up to 100 processes at once, just to see what happens.  When I did
>>> that (with my SIGCHLD patch) I got a few errors such as illegal instructi=
>on
>>> and a bus error or two.  I believe what was happening was that I was
>>> taking too many signals at once, because of the following code in
>>> ThreadPosix.m3:
>>>
>>> PROCEDURE switch_thread (sig: int) RAISES {Alerted} =3D
>>>  BEGIN
>>>    allow_sigvtalrm ();  (* here *)
>>>
>>>    INC(inCritical);
>>>    (* mark signal as being delivered *)
>>>    IF sig >=3D 0 AND sig < MaxSigs THEN
>>>      gotSigs :=3D gotSigs + SET OF Sig { sig }
>>>    END;
>>>    DEC(inCritical);
>>>
>>>    IF inCritical =3D 0 AND heapState.inCritical =3D 0 THEN
>>>      InternalYield ()
>>>    END;
>>>  END switch_thread;
>>>
>>> I found it a bit odd that the first action of the signal handler is to
>>> enable signals.  With VTALRM this is innocuous, but in my case, imagine
>>> what happens if you have a lot of exited child processes queued up.
>>> What happens is that your signal stack gets wound deeper and deeper, and
>>> each time you catch one SIGCHLD you ask for another.  Eventually you
>>> run out of stack space.
>>>
>>> My change, which appears to fix this problem, is to split up the "allow"
>>> function as follows:
>>>
>>> PROCEDURE switch_thread (sig: int) RAISES {Alerted} =3D
>>>  BEGIN
>>>    allow_sigvtalrm ();                               (* VTALRM only *)
>>>
>>>    INC(inCritical);
>>>    IF inCritical =3D 1 THEN allow_othersigs ();  END;  (* SIGCHLD *)
>>>
>>>    (* mark signal as being delivered *)
>>>    IF sig >=3D 0 AND sig < MaxSigs THEN
>>>      gotSigs :=3D gotSigs + SET OF Sig { sig }
>>>    END;
>>>    DEC(inCritical);
>>>
>>>    IF inCritical =3D 0 AND heapState.inCritical =3D 0 THEN
>>>      InternalYield ()
>>>    END;
>>>  END switch_thread;
>>>
>>> This way, SIGCHLD is not taken in nested signal handlers.  But is there a
>>> problem here?  Well, the fallback is that it takes 0.1 seconds to wake
>>> up on Process.Wait .
>>>
>>> Is there a chance that SIGCHLD signals are left pending?  My argument is
>>> no, since in the "base" environment (i.e., not executing in a  =20
>>> signal handler)
>>> SIGCHLD is enabled as a postcondition of allow_othersigs.
>>>
>>> All *other* occurrences of allow_sigvtalrm are of course replaced with
>>>
>>> allow_sigvtalrm(); allow_othersigs();
>>>
>>> The goal is that SIGCHLD be enabled precisely when SIGVTALRM is  =20
>>> except inside
>>> the signal handler itself.
>>>
>>>     Mika
>>>
>>>
>>> Tony Hosking writes:
>>>> Sorry, I missed you e-mail describing this change.  Looks good.
>>>>
>>>> Antony Hosking | Associate Professor | Computer Science | Purdue =3D
>>>> University
>>>> 305 N. University Street | West Lafayette | IN 47907 | USA
>>>> Office +1 765 494 6001 | Mobile +1 765 427 5484
>>>>
>>>>
>>>>
>>>>
>>>> On Feb 13, 2011, at 3:37 PM, Mika Nystrom wrote:
>>>>
>>>>> Tony,
>>>>> =3D20
>>>>> The old user threads has an old problem with Process.Wait.  It has a=3D=
>20=3D
>>>>
>>>>> Thread.Pause(0.1d0) in it.  This causes the compiler to run three =3D
>>>> times
>>>>> slower than it should.  The m3build I've been using for years has
>>>>> that Thread.Pause removed (so it uses 100% CPU instead, =3D
>>>> busy-waiting...)
>>>>> =3D20
>>>>> But yeah user threads works the same as it used to (except for=3D20
>>>>> now calling pthread_atfork, which I'll return to in another email...)
>>>>> =3D20
>>>>>     Mika
>> ...
>>
>
>
>
>--=20
>Olaf Wagner -- elego Software Solutions GmbH
>                Gustav-Meyer-Allee 25 / Geb=C3=A4ude 12, 13355 Berlin, Germa=
>ny
>phone: +49 30 23 45 86 96  mobile: +49 177 2345 869  fax: +49 30 23 45 86 95
>    http://www.elegosoft.com | Gesch=C3=A4ftsf=C3=BChrer: Olaf Wagner | Sitz=
>: Berlin
>Handelregister: Amtsgericht Charlottenburg HRB 77719 | USt-IdNr: DE163214194