[M3devel] purported condition variable problems on Win32?

Rodney M. Bates rodney_bates at lcwb.coop
Wed Jul 20 18:44:24 CEST 2016


In looking at ThreadWin32.m3 more, I think I see a significant bug.

This gets confusing, since there are Modula-3 waits, signals,
etc. and similar concepts in Win32, but which are not the
same.  I am going the try to put "M3-" and "Win-" prefixes on
things to make this clearer.  I cartainly need to do this
for my own benefit, if for nobody else.

Suppose two threads are both M3-waiting in Wait(m,c,...), further
Win-waiting on c.waitEvent in WaitForMultipleObjects at :301.
c.tickets = 0, and c.waiters = 2.

Now, (M3-)Signal (c) happens.  This does (Win-)SetEvent(c.waitEvent),
which will allow both the waiters to Win-wakeup sometime soon.
Before releasing c.lock, Signal increments c.tickets to 1 (supposedly
to eventually allow only one waiter to M3-wakeup) and increments
c.counter, setting things up so that the waitDone expression at :308
will be TRUE for the waiting threads, the next time they evaluate it.

Each waiter in turn Win-wakes, gets c.lock, sets waitDone TRUE,
releases c.lock, and tries to acquire m.  The problem arises
because the second waiter can get to these steps before the first
can acquire m, acquire c.lock, and DEC(c.tickets) at :322.  The
second to reach :308 will find c.tickets # 0, as did the first,
and proceed.

Each will eventually, do a M3-wake.  If this were Posix condition
variables, this would be wrong, because Posix says only one
waiter proceeds.  For M3, it's OK, if unnecessary, because M3
says one or more waiters proceed.  However, each will, in the
process, get to :322, and DEC(c.tickets).  Signal only provided
one ticket, but the waiters have taken two.  c.tickets goes
negative.

After this, when some thread can do another (M3-)Wait, and some
other does another (M3-)Signal, c.tickets increments only back to
zero, which means the new waiter will not M3-wake, even though it
should.


I am going to try to construct a test case that will force
this scenario.

See more below:


On 07/18/2016 08:53 PM, Jay wrote:
> I wrote this version. Right, based on Schmidt. I also found similar code in Java. There are comments as to both of these in the code.
>
> Otherwise previous version I think had a very large lock.
>
> Changing the < to !=, is that really ok? I get that 32bit overflow is possible.
>
> We can change to a 64bit counter if < is required. That really can't rollover. I.e. Even a 64bit cycle counter can't rollover, and this would advance much slower.
>
> I guess you are saying it would be unfair, which is ok.
>
> There is pattern called "directed notification" or such that might be preferable? It has a certain significant inefficiency though, like creating an event per notify. I think it is what boost's condition variable are using.
>
> I'll check on the unused event.
>
>   - Jay
>
>> On Jul 18, 2016, at 7:25 AM, "Rodney M. Bates" <rodney_bates at lcwb.coop> wrote:
>>
>>
>>
>>
>>
>> I spent some time looking over the signal part of ThreadWin32.m3.  A few
>> comments:
>>
>> This looks like a very direct implementation of Schmidts' generation count
>> algorithm, with a number of important little details additionally taken care
>> of.
>>
>> On small difference is at line 308 (B, below) where Schmidt does
>> c.counter>count instead of #.  Usually, this would make no difference,
>> since counter only increases, but on a long-running program, it could
>> overflow.  If this happened,as it often does, as silent wrap-around to
>> FIRST(INTEGER), some threads could be trapped for an extremely long
>> time, waiting for counter come back around greater than their count.  As
>> coded in ThreadWin32, they would be allowed to proceed normally.
>>
>> The only glitch would be that, if there were simultaneously waiting threads
>> separated by NUMBER(INTEGER) truly different generations (but the same value
>> of c.counter), the recent ones could unfairly proceed along with the
>> extremely old ones.  That seems extremely less likely than the
>> mere existence of an overflow.
>>
>> Both type Condition and type Activation have a field waitEvent: HANDLE.
>> Condition.waitEvent is extensively used, but Activation.waitEvent is
>> only created at :537 and deleted at :554, but never used in any real
>> way.  It could be deleted.
>>
>> As for this code:
>>
>>     EnterCriticalSection(conditionLock);
>>
>>       (* Capture the value of the counter before we start waiting.
>>        * We will not stop waiting until the counter changes.
>>        * That is, we will not stop waiting until a signal
>>        * comes in after we start waiting.
>>        *)
>>
>>       count := c.counter;
>>       INC(c.waiters);
>>
>>     LeaveCriticalSection(conditionLock);
>>     m.release(); (* can this be moved to before taking conditionLock? *)
>>
>> No, it is important to sample and save c.counter in the order threads wait.
>> Retaining m until this is done ensures this.  Otherwise, some other thread
>> that waited later could have altered c.counter before this one gets conditionLock
>> and saves its copy of c.counter.
>>
>> I have made several comment changes to ThreadWin32.m3 that would have made
>> it easier to vet.  I will commit these, but only comments.  It is either
>> very difficult or impossible for me to test or even recompile this module,
>> and, especially as fragile as this kind of code is, I don't want to commit
>> any substantive changes untested and uncompiled.
>>
>> More below:
>>
>>> On 07/07/2016 04:24 AM, Jay K wrote:
>>> So...I do NOT understand all of this.
>>>
>>>
>>> However, comparing the three implementations
>>> and attempting to understand ours,
>>> I am struck by the following
>>>
>>>   - Our broadcast seems ok, but
>>>   - our signal seems to wake everyone,
>>>     and then...they race a bit,
>>>     one will decide it is last, and
>>>     reset the event, but every prior waiter
>>>     is still woken.
>>> Why not merge the following chunks:
>>
>> I don't think this will work.  It is important that, once a thread has decided,
>> at A, that it can fairly proceed, it is guaranteed to be the next thread to proceed
>> (i.e, to return from [Alert]Wait), before it decrements c.tickets at C, and, thus,
>> if it is also the last of its generation to proceed, resets c.waitevent.  This
>> is ensured by its having already reacquired m first, preventing any other thread
>> from returning from its Wait yet.

I was wrong about this.  There is already no fairness being enforced here.  Even with
only one ticked, all waiters Win-wake at :301, the order they get c.lock at :307 is
arbitrary anyway.  There is no  benefit in trying to control the order they reacquire m.
c.counter has nothing to do with fairness.  What it does is prevent Win-awakened waiters
that waited after the most recent Signal/Broadcast from M3-waking, a correctness issue.

Merging, as you proposed, is probably close to the fix for this bug.


>>
>> And it can't attempt to acquire m while already holding conditionLock, because this would
>> invite deadlock, since, elsewhere, a thread can try to acquire conditionLock while
>> already holding m, at line 265.
>>
>>>
>>>
>>>      WHILE (NOT alerted) AND (NOT waitDone) DO
>>> .
>>> .
>>> .
>>>   1
>>>        EnterCriticalSection(conditionLock);
>>>          waitDone := (c.tickets # 0 AND c.counter # count);
>> A:---------^
>> B:--------------------------------------------------^
>>>        LeaveCriticalSection(conditionLock);
>>>      END; (* WHILE *)
>>>      IF waitDone THEN
>>>        alerted := FALSE;
>>>      END;
>>>      m.acquire();
>>> 2
>>>      EnterCriticalSection(conditionLock);
>>>        DEC(c.waiters);
>>>        IF waitDone THEN   esp. here.
>>>          DEC(c.tickets);
>>
>> C:---------------^
>>>          lastWaiter := (c.tickets = 0);
>>>        END;
>>>      LeaveCriticalSection(conditionLock);
>>>
>>>
>>>   That is, if we decrement tickets earlier
>>>   within the first critical section,
>>>   while we will still wake everyone,
>>>   only one will decide waitDone and the rest will keep looping.
>>>   A downside of this, perhaps, is that waking all for Broadcast
>>>   might be a little slower.
>>>   but more so, in both the "ptw32" (pthreads for win32) and Boost threads
>>>   implementations, they seem to deal with this differently than us,
>>>   and they each do about the same thing -- they use a counted semaphore.
>>>   In the boost case, it appears they duplicate the semaphore for every
>>>   notification generation, which seems expensive.
>>>
>>>   Perhaps if they can guarantee some lifetimes, they don't need to duplicate it.
>>>   Or they can do their own reference counting?
>>>   jdk7 seems to looke like jdk6.
>>>   The code is gone in jdk8 and I can't easily find the delete in history.
>>>    The lock merging jdk does probably helps here too.
>>>    In fact they merge #1 and #2 above as a result.
>>>    But they still initially wake all threads for signal, not just broadcast.
>>>    There is also the problem that all the event waits
>>>    are followed by EnterCriticalSection (or jdk facsimile).
>>>    - Jay
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------!
>> ---
>>> From: jay.krell at cornell.edu
>>> To: m3devel at elegosoft.com
>>> Subject: RE: [M3devel] purported condition variable problems on Win32?
>>> Date: Tue, 5 Jul 2016 08:26:36 +0000
>>>
>>> Here is another implementation to consider:
>>>
>>> https://github.com/boostorg/thread/blob/develop/include/boost/thread/win32/condition_variable.hpp
>>>
>>>   - Jay
>>>
>>>
>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------!
>> ---
>>> From: jay.krell at cornell.edu
>>> To: m3devel at elegosoft.com
>>> Date: Tue, 5 Jul 2016 08:19:23 +0000
>>> Subject: [M3devel] purported condition variable problems on Win32?
>>>
>>>   https://sourceforge.net/p/pthreads4w/code/ci/master/tree/README.CV
>>>
>>>   vs.
>>>
>>> http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
>>>   vs.
>>>   https://modula3.elegosoft.com/cgi-bin/cvsweb.cgi/cm3/m3-libs/m3core/src/thread/WIN32/ThreadWin32.m3?rev=1.210.2.1;content-type=text%2Fplain
>>>   I wrote this:
>>>   PROCEDURE XWait(m: Mutex; c: Condition; act: Activation;
>>>                  alertable: BOOLEAN) RAISES {Alerted} =
>>>    (* LL = m on entry and exit, but not for the duration
>>>     * see C:\src\jdk-6u14-ea-src-b05-jrl-23_apr_2009\hotspot\agent\src\os\win32\Monitor.cpp
>>>     * NOTE that they merge the user lock and the condition lock.
>>>     * http://www.cs.wustl.edu/~schmidt/win32-cv-1.html
>>>     *   "3.3. The Generation Count Solution"
>>>     *)
>>>
>>>   Do we have the problems described in README.CV?
>>>
>>>   I haven't looked through the ACE code to see
>>>   to what extent they resemble solution 3.3, or if they
>>>   changed as a result of this discussion -- which I admit I don't understand
>>> and haven't read closely.
>>>
>>>
>>> Spurious wakeups are ok, though should be minimized.
>>>
>>> I'd still rather not drop pre-Vista support but I realize it becomes more interesting as time advances.
>>>
>>>   Thank you,
>>>    - Jay
>>>
>>> _______________________________________________ M3devel mailing list M3devel at elegosoft.com https://m3lists.elegosoft.com/mailman/listinfo/m3devel
>>>
>>>
>>> _______________________________________________
>>> M3devel mailing list
>>> M3devel at elegosoft.com
>>> https://m3lists.elegosoft.com/mailman/listinfo/m3devel
>>
>> --
>> Rodney Bates
>> rodney.m.bates at acm.org
>>
>>
>

-- 
Rodney Bates
rodney.m.bates at acm.org



More information about the M3devel mailing list