Building a Tiny Model Checker, Part 3

Blocking, waiting, and await loops.

In the previous posts of this series, we built a tiny model checker capable of systematically exploring all possible interleavings of concurrent tasks. In Part 1, tasks had no shared state. In Part 2, we introduced shared memory and safety checking, which already allowed us to find real concurrency bugs.

So far, however, all tasks were always runnable!

In real concurrent programs, this assumption does not hold. Threads routinely wait for conditions to become true, spin until shared state changes, or block on locks and queues. In this post, we take the next step and introduce blocking and waiting.

1. From interleaving to blocking

Up to now, the scheduler had a very simple job: at each scheduling point, it could choose any task that had not yet terminated. There were no blocked tasks, no disabled transitions, and no notion of progress dependencies between tasks.

Once we introduce waiting, this changes fundamentally.

A task that is waiting cannot make progress on its own. If such a task were scheduled repeatedly, it would simply re-execute the same waiting step forever, making no progress and contributing nothing new to the exploration. In other words, the scheduler would be exploring executions that are trivially equivalent and uninformative.

To avoid this, we must distinguish between runnable tasks and blocked tasks, and only schedule tasks when they are able to make progress.

This raises several questions:

What does it mean, semantically, for a task to “wait”?
How does waiting interact with systematic exploration?
When should a blocked task be considered runnable again?
What should the checker do if all tasks are blocked?

To keep the model executable and minimal, we start with the most common form of waiting in low-level concurrent code: busy-waiting on shared state.

2. Awaiting shared state

A very common pattern in concurrent algorithms is the await loop:

while (x == 0)
    ;

In a real system, this loop may involve yielding, sleeping, or blocking primitives. Typically x will be volatile or (ideally) have an atomic type such as atomic_int. In our tiny model checker, however, we want to capture the logical effect:

“This task cannot proceed until some shared state changes.”

To express this directly, we introduce a new DSL primitive:

(define (AWAIT loc)
  (await-location loc))

Intuitively, (AWAIT loc) suspends the current task and removes it from the runnable set. Execution resumes only after another task writes to location loc.

Unlike READ or WRITE, AWAIT is not merely an interleaving point: it blocks the task entirely until it is explicitly re-enabled.

To support this, the WRITE operation is reformulated so that it signals tasks waiting on the written location:

(define (WRITE loc val)
  (YIELD)
  (write-location loc val)
  (signal-location loc))

The signal-location operation moves all tasks blocked on loc back into the runnable task queue.

3. Blocking as a scheduler concern

With await-location and signal-location, the scheduler can now distinguish between runnable and blocked tasks using a small monitor structure:

(define *loc-waiters* (make-hash-table equal? string-hash 1024))

(define (await-location loc)
  (let* ((loc (location->string loc))
         (waiters (hash-table-ref/default *loc-waiters* loc '())))
    (call/cc
      (lambda (cc)
        (hash-table-set! *loc-waiters* loc (cons cc waiters))
        (if (no-tasks?)
            (error "AWAIT: termination violation")
            (resume))))))

(define (signal-location loc)
  (let* ((loc (location->string loc))
         (waiters (hash-table-ref/default *loc-waiters* loc '())))
    (for-each push-task waiters)
    (hash-table-delete! *loc-waiters* loc)))

The monitor maps locations to waiting tasks. When a task executes (AWAIT loc):

its continuation is captured,
it is removed from the runnable task set (since the task is currently executing, its continuation is not in any queue at that moment),
and it is registered as waiting on loc.

Every WRITE operation signals the written location, pushing any tasks waiting on it back into the runnable task queue.

This introduces a new kind of terminal execution. If the current task attempts to block and there are no runnable tasks left (that is, (no-tasks?) returns true), then the system has reached a state of global blocking.

In this checker, we treat global blocking as an error and report it as a counterexample.

4. Example: ticket lock

We now consider a classic synchronization primitive: a ticket lock. The lock uses two shared counters:

ticket: the next ticket to hand out,
grant: the ticket currently being served.

;; ticket lock implementation ----
(INIT 'grant 0)
(INIT 'ticket 0)

(define (take-ticket)
  (ATOMIC
    (let* ((tck (+ 1 (READ 'ticket))))
      (WRITE 'ticket tck)
      tck)))

(define (wait-grant t)
  (if (= t (READ 'grant))
      t
      (begin
        (AWAIT 'grant)
        (wait-grant t))))

(define (release t)
  (WRITE 'grant (+ t 1)))

(define (acquire)
  (wait-grant (take-ticket)))

We left a bug in this code on purpose. Can you spot it? We will see shortly how the checker detects it.

The client code that exercises this lock is straightforward. A task acquires the lock, saving the ticket t, then reads location x, writes back the incremented value, and finally releases the ticket t. Any modification inside the critical section would work equally well.

(INIT 'x 0)

(define (task)
  (let ((t (acquire)))
    (let ((a (READ 'x)))
      (WRITE 'x (+ a 1))
      (release t))))

Update: Previously, I had t and a defined in the the same let block without realizing that Scheme implementations are free to decide the order in which the elements of a list are resolved (and their side effects take place). To ensure that the acquire of the lock occurs before the READ of x, I have now split their definitions in two let blocks.

We spawn three tasks and check that the critical section executes three times by inspecting the final value of x.

(SPAWN task)
(SPAWN task)
(SPAWN task)

(EXPLORE
  (lambda () (= (READ 'x) 3)))

Running this example (see the tmc-part3.scm and tmc-part3-ticketlock.scm), the checker quickly finds an execution where all tasks become blocked:

Error: AWAIT: termination violation

The issue is that take-ticket mistakenly returns the new ticket value rather than the old one. As a result, the lock can skip a waiter, leaving all remaining tasks blocked on a grant value that will never be produced.

This is our first example where an execution does not end because tasks finish, but because the system reaches a state with no runnable tasks. In Parts 1 and 2, this could not happen: every task was always runnable. With AWAIT, global blocking becomes a meaningful terminal outcome.

The bug is subtle but fundamental: returning the wrong ticket value breaks the hand-off discipline of the lock and can cause the system to deadlock.

Here is a simple fix:

(define (take-ticket)
  (ATOMIC
    (let* ((tck (READ 'ticket)))
      (WRITE 'ticket (+ 1 tck))
      tck)))

With this change, each task receives a unique ticket, and grant advances monotonically to serve tickets in order.

5. Exploring the fixed code

With the corrected ticket lock, the checker no longer finds deadlocks, and the postcondition holds. However, even for this tiny example, the number of executions grows rapidly:

2 tasks: 130 executions
3 tasks: 502,152 executions

This explosion is not specific to locks. Re-running the fixed SPSC ring buffer example from Part 2 using the same checker yields 385,253 executions

Systematic exploration is powerful: it turns rare bugs into concrete counterexamples. At the same time, it explores many executions that are equivalent in observable behavior.

6. Summary and outlook

In this post, we extended the tiny model checker with blocking and waiting via AWAIT and simple deadlock detection as a first-class outcome.

This allows us to model real synchronization algorithms, such as ticket locks, and to detect both safety violations and global blocking.

However, the execution counts already hint at the next challenge: most explored schedules are redundant. In the next post, we will tackle this problem directly by introducing partial-order reduction, using a technique called sleep sets to drastically reduce the number of executions without losing bug-finding power.

If you want to try out the code yourself, you can download:

See also other parts of this series. Feedback is welcome – please send a message to my public inbox: ~db7/public-inbox@lists.sr.ht (Archive)