❯ cd projects/kairos
Kairos
● in rewriteScheduling that reads intent. FastAPI backend with a state-machine core, now rewriting the frontend from scratch in Next.js 16.
fig. 01 — 20s demo
38ms
p95 latency
12.4k/day
scheduled ops
84%
test coverage
99.96%
uptime (90d)
Context
Solo, open source, 2025–ongoing (9 months active). I own design, backend, and API surface. The core problem: writing a scheduler that fails loudly instead of dropping events.Timeline
Jul 2025
First commit
Sketch of a cron alternative. Just cron with retries, honestly.
Aug 2025
State machine
Replaced the "run and hope" loop with an explicit FSM. Everything downstream cleaner.
Oct 2025
Postgres over Redis
Migrated off Redis queues. Advisory locks, single source of truth.
Jan 2026
The off-by-one
1 in ~120 jobs expired quietly. Wrote the boundary tests I should have written first.
Mar 2026
Frontend rewrite
Started Next.js 16 frontend from scratch. Backend stays.
Apr 2026
Current
p95 38ms, 84% coverage, 99.96% uptime. Quiet.
Screenshots
[ dashboard · main view ]
[ job detail ]
[ fsm diagram ]
fig. 02 — ui + state machine
Key technical decisions
01
state machine › cron
A scheduler that "mostly runs" but silently drops events is worse than one that halts loudly. The FSM made failure modes enumerable.
02
postgres + advisory locks › redis queues
Single source of truth beats 2× throughput at my scale. Transactions were worth keeping.
03
fastapi › django
~40 endpoints, Pydantic for free contract tests, no admin needed.
04
rewrite frontend › rewrite everything
The bugs are in the UI contract, not the scheduling logic. Rewriting both would have been a different project.
The off-by-one
kairos/scheduler.pypy
1def within_retry_window(jobs, high):2 # bug: j < high - 1 drops the last slot3 for j in jobs:4- if j < high - 1:5+ if j < high:6 yield j
One in ~120 scheduled jobs quietly expired instead of running. Lesson: write boundary tests before happy-path tests. The FSM made this findable; without it the bug would've read as noise.
What I'd do differently
Write the state machine before the HTTP layer, not alongside it. I pinned myself into endpoint shapes early that the FSM had to work around — the rewrite undoes most of that.