Blocking, waiting, and await loops.
In the previous posts of this series, we built a tiny model checker capable of systematically exploring all possible interleavings of concurrent tasks. In Part 1, tasks had no shared state. In Part 2, we introduced shared memory and safety checking, which already allowed us to find real concurrency bugs.
So far, however, all tasks were always runnable!
In real concurrent programs, this assumption does not hold. Threads routinely wait for conditions to become true, spin until shared state changes, or block on locks and queues. In this post, we take the next step and introduce blocking and waiting.
Up to now, the scheduler had a very simple job: at each scheduling point, it could choose any task that had not yet terminated. There were no blocked tasks, no disabled transitions, and no notion of progress dependencies between tasks.
Once we introduce waiting, this changes fundamentally.
A task that is waiting cannot make progress on its own. If such a task were scheduled repeatedly, it would simply re-execute the same waiting step forever, making no progress and contributing nothing new to the exploration. In other words, the scheduler would be exploring executions that are trivially equivalent and uninformative.
To avoid this, we must distinguish between runnable tasks and blocked tasks, and only schedule tasks when they are able to make progress.
This raises several questions:
To keep the model executable and minimal, we start with the most common form of waiting in low-level concurrent code: busy-waiting on shared state.
A very common pattern in concurrent algorithms is the await loop:
while (x == 0)
;
In a real system, this loop may involve yielding, sleeping, or blocking
primitives. Typically x will be volatile or (ideally) have an atomic type
such as atomic_int. In our tiny model checker, however, we want to capture
the logical effect:
“This task cannot proceed until some shared state changes.”
To express this directly, we introduce a new DSL primitive:
(define (AWAIT loc)
(await-location loc))
Intuitively, (AWAIT loc) suspends the current task and removes it from the
runnable set. Execution resumes only after another task writes to
location loc.
Unlike READ or WRITE, AWAIT is not merely an interleaving point: it blocks
the task entirely until it is explicitly re-enabled.
To support this, the WRITE operation is reformulated so that
it signals tasks waiting on the written location:
(define (WRITE loc val)
(YIELD)
(write-location loc val)
(signal-location loc))
The signal-location operation moves all tasks blocked on loc back into the
runnable task queue.
With await-location and signal-location, the scheduler can now distinguish
between runnable and blocked tasks using a small monitor structure:
(define *loc-waiters* (make-hash-table equal? string-hash 1024))
(define (await-location loc)
(let* ((loc (location->string loc))
(waiters (hash-table-ref/default *loc-waiters* loc '())))
(call/cc
(lambda (cc)
(hash-table-set! *loc-waiters* loc (cons cc waiters))
(if (no-tasks?)
(error "AWAIT: termination violation")
(resume))))))
(define (signal-location loc)
(let* ((loc (location->string loc))
(waiters (hash-table-ref/default *loc-waiters* loc '())))
(for-each push-task waiters)
(hash-table-delete! *loc-waiters* loc)))
The monitor maps locations to waiting tasks. When a task executes (AWAIT loc):
loc.Every WRITE operation signals the written location, pushing any tasks waiting
on it back into the runnable task queue.
This introduces a new kind of terminal execution. If the current task attempts
to block and there are no runnable tasks left (that is, (no-tasks?) returns
true), then the system has reached a state of global blocking.
In this checker, we treat global blocking as an error and report it as a counterexample.
We now consider a classic synchronization primitive: a ticket lock. The lock uses two shared counters:
ticket: the next ticket to hand out,grant: the ticket currently being served.;; ticket lock implementation ----
(INIT 'grant 0)
(INIT 'ticket 0)
(define (take-ticket)
(ATOMIC
(let* ((tck (+ 1 (READ 'ticket))))
(WRITE 'ticket tck)
tck)))
(define (wait-grant t)
(if (= t (READ 'grant))
t
(begin
(AWAIT 'grant)
(wait-grant t))))
(define (release t)
(WRITE 'grant (+ t 1)))
(define (acquire)
(wait-grant (take-ticket)))
We left a bug in this code on purpose. Can you spot it? We will see shortly how the checker detects it.
The client code that exercises this lock is straightforward. A task acquires the
lock, saving the ticket t, then reads location x, writes back the
incremented value, and finally releases the ticket t. Any modification inside
the critical section would work equally well.
(INIT 'x 0)
(define (task)
(let ((t (acquire)))
(let ((a (READ 'x)))
(WRITE 'x (+ a 1))
(release t))))
Update: Previously, I had
tandadefined in the the sameletblock without realizing that Scheme implementations are free to decide the order in which the elements of a list are resolved (and their side effects take place). To ensure that theacquireof the lock occurs before theREADofx, I have now split their definitions in twoletblocks.
We spawn three tasks and check that the critical section executes three times by
inspecting the final value of x.
(SPAWN task)
(SPAWN task)
(SPAWN task)
(EXPLORE
(lambda () (= (READ 'x) 3)))
Running this example (see the tmc-part3.scm and tmc-part3-ticketlock.scm), the checker quickly finds an execution where all tasks become blocked:
Error: AWAIT: termination violation
The issue is that take-ticket mistakenly returns the new ticket value rather
than the old one. As a result, the lock can skip a waiter, leaving all remaining
tasks blocked on a grant value that will never be produced.
This is our first example where an execution does not end because tasks finish,
but because the system reaches a state with no runnable tasks. In Parts 1 and 2,
this could not happen: every task was always runnable. With AWAIT, global
blocking becomes a meaningful terminal outcome.
The bug is subtle but fundamental: returning the wrong ticket value breaks the hand-off discipline of the lock and can cause the system to deadlock.
Here is a simple fix:
(define (take-ticket)
(ATOMIC
(let* ((tck (READ 'ticket)))
(WRITE 'ticket (+ 1 tck))
tck)))
With this change, each task receives a unique ticket, and grant advances
monotonically to serve tickets in order.
With the corrected ticket lock, the checker no longer finds deadlocks, and the postcondition holds. However, even for this tiny example, the number of executions grows rapidly:
This explosion is not specific to locks. Re-running the fixed SPSC ring buffer example from Part 2 using the same checker yields 385,253 executions
Systematic exploration is powerful: it turns rare bugs into concrete counterexamples. At the same time, it explores many executions that are equivalent in observable behavior.
In this post, we extended the tiny model checker with blocking and waiting
via AWAIT and simple deadlock detection as a first-class outcome.
This allows us to model real synchronization algorithms, such as ticket locks, and to detect both safety violations and global blocking.
However, the execution counts already hint at the next challenge: most explored schedules are redundant. In the next post, we will tackle this problem directly by introducing partial-order reduction, using a technique called sleep sets to drastically reduce the number of executions without losing bug-finding power.
If you want to try out the code yourself, you can download:
See also other parts of this series. Feedback is welcome – please send a message to my public inbox: ~db7/public-inbox@lists.sr.ht (Archive)
© 2024-2026 db7 — licensed under CC BY 4.0.