Samridh Limbu · Software Engineer

One in 120 jobs silently expired. A forensic walkthrough of the bug that made me rewrite my test pyramid.

For three weeks, one in roughly 120 scheduled jobs expired quietly. No error. No alert. The job would enter the retry window, get checked, and somehow never qualify for re-dispatch. The FSM moved it to EXPIRED on schedule. The logs were clean.

I only caught it because a user noticed a missing webhook delivery and filed a bug. At that point the job had already been marked expired. The window to re-run it had closed.

Silent failure is a liability; loud failure is just a bug. The FSM made this findable — without it, the symptom would have read as random noise.

The root cause

The retry window check used a j < high - 1 guard instead of j < high. One slot at the edge of the window was always skipped. For jobs with exactly one retry slot, this meant zero retries ever succeeded — the only valid slot was the one being excluded.

The condition survived code review because the variable names were short, the logic was dense, and the test suite only covered the happy path and a clear-failure case. Nobody tested the boundary.

The fix

One character. j < high - 1 → j < high. The fix took thirty seconds. The investigation took four hours. The test I wrote afterward — boundary cases at high - 2, high - 1, and high — took twenty minutes.

What I changed afterward

I rewrote my test pyramid. Before this bug, my tests were mostly happy-path integration tests with a handful of unit tests for obviously complex logic. After it, I added a rule: every function that deals with numeric boundaries gets explicit tests at n-1, n, and n+1. Every transition function gets tests for invalid inputs.

The coverage number went from 71% to 84%. More importantly, the kind of coverage changed. I stopped measuring coverage as "lines hit" and started measuring it as "boundaries tested."

An off-by-one in production, and what it cost

The root cause

The fix

What I changed afterward