Samridh Limbu · Software Engineer

How an explicit FSM made a flaky scheduler legible — and findable when it broke.

The first version of Kairos was a loop and a prayer. Every minute a worker woke up, read the "scheduled" table, picked the jobs whose run_at had passed, and tried to dispatch them. If anything in that dispatch threw — a network blip, a contract change, a Redis hiccup — the row stayed status='scheduled' and got retried next tick. For a long time that worked.

Then I wrote a second feature. Then a third. And the loop started to lie.

A scheduler that "mostly runs" but silently drops events is worse than one that halts loudly. Silent failure is a liability; loud failure is just a bug.

The problem wasn't the cron. It was the absence of a contract between "I scheduled this" and "this ran." Once I added a second dispatch path — a webhook retry, a manual trigger from the UI — the implicit contract broke. Two code paths, one shared status column, no coordination.

What a state machine bought

Replacing the loop with an explicit FSM forced me to enumerate every transition: SCHEDULED → RUNNING → DONE, RUNNING → FAILED, FAILED → SCHEDULED (with backoff), SCHEDULED → EXPIRED (if TTL exceeded). Suddenly the "silently dropped job" had a name. It was an unhandled transition from SCHEDULED when the TTL window closed without a runner picking it up.

The FSM didn't fix the bug immediately. But it made the bug findable. When I added the off-by-one check to the retry window, the failing test pointed at a specific transition rather than a generic "job didn't run" failure. That's the real value: not correctness, but legibility.

The implementation

The FSM lives in a single Python class. States are a StrEnum. Transitions are validated in a can_transition() method before any database write. The HTTP layer never touches status directly — it calls job.transition(target_state) and catches InvalidTransition.

Postgres advisory locks handle the concurrent runner problem. Before picking up a job, a runner acquires an advisory lock keyed to the job ID. If another runner holds it, the query returns nothing. No double-dispatch, no Redis, no distributed lock service — just a feature Postgres has had since 2003.

What I'd do differently

Write the FSM before the HTTP layer. I let endpoint shapes drive the state design early on, which meant the FSM had to accommodate decisions already baked into the API. The rewrite inverts this — the state machine is the specification, and the HTTP layer just exposes it.

State machines over cron

What a state machine bought

The implementation

What I'd do differently