Lesson 06 · Engineering case study

The vitest-orphan-leak

A real incident from building this fusion. Two harmless-looking facts combined into 16 immortal processes pinning the machine. The fix is a small classic of defensive systems engineering — and a lesson in fixing the interaction, not just one cause.

The incident — 2026-06-23

16 orphaned node (vitest) workers. PPID = 1 (reparented to init), each at ~90–96% CPU, alive for ~11 hours, cwd in this repo. They had been spawned about one every 2 minutes over ~30 minutes by repeated host pnpm test runs.

orphaned workers (PPID=1)

~1550%

total CPU (load avg 34)

~11h

alive, spinning

The two causes — neither fatal alone

This is the heart of the lesson. The leak needed both of these to be true at once:

Cause	What it is
1 · A hanging test	A test that hangs with no teardown — a server/socket/MCP left open, a `setInterval` never cleared, an unresolved promise. On its own: annoying, but you'd notice and Ctrl-C it.
2 · Parent killed without killing the tree	The parent process (a Bash/session timeout) is killed without killing the worker tree. Vitest's `tinypool` forks then reparent to PID 1 and keep spinning. On its own (with tests that exit): harmless.

Why "either alone is survivable" matters. A hung test that you kill cleanly takes its workers with it. An orphaned worker that was going to exit anyway exits. It's the combination — a worker that will never exit on its own, detached from any parent that would reap it — that produces an immortal CPU-pinning process. Robust fixes target the interaction.

The fix — two layers, one per cause

A single patch wouldn't be robust: harden the config and a different hang still orphans; add the wrapper and a hang inside the group still pins a core until the wall-clock timeout. Both layers ship.

Layer 1 — make a hang fail, not hang (cause 1)

Bounded timeouts in vitest.config.ts (applied to both root projects via extends) turn an infinite hang into a bounded failure, and the forks pool isolates a stuck file in a child process Vitest force-kills on teardown — so a stuck file can't pin a core indefinitely.

// vitest.config.ts:20-27
// Anti-orphan hardening: a hung test (server/socket/MCP/interval/unresolved
// promise with no teardown) must FAIL on a bounded timeout, never hang a
// worker forever. The `forks` pool isolates hangs in child processes that
// Vitest force-kills on teardown, so a stuck file cannot pin a CPU core.
testTimeout: 15_000,
hookTimeout: 15_000,
teardownTimeout: 10_000,
pool: 'forks',

Layer 2 — make orphaning impossible (cause 2)

scripts/safe-test.mjs (exposed as pnpm test:safe) runs the suite in its own process group, under a hard wall-clock timeout, then kills the whole group — not just the parent PID — and sweeps any stray vitest as a last-resort net. Three techniques, each load-bearing:

// scripts/safe-test.mjs:34-44 — detached group + negative-PID kill
// detached:true => POSIX setsid => the child leads a NEW process group, so
// `kill(-pid)` reaches every descendant (vitest main + all tinypool forks).
const child = spawn(bin, args, { stdio: 'inherit', detached: true });

const killGroup = (signal) => {
  try { process.kill(-child.pid, signal); }   // NEGATIVE pid = the whole group
  catch { /* group already gone */ }
};

// scripts/safe-test.mjs:26-32 — the last-resort sweep
const sweep = () => {
  try { execFileSync('pkill', ['-9', '-f', 'vitest'], { stdio: 'ignore' }); }
  catch { /* nothing left to kill — pkill exits non-zero when no match */ }
};

detached:true → the child leads a new process group (POSIX setsid), so a single signal can reach every descendant.
process.kill(-child.pid, …) → the negative PID signals the whole group, including tinypool forks — never a bare PID.
pkill -9 -f vitest sweep → catches anything that already escaped to PID 1 (the orphaning case), so leakage can't accumulate across runs.
A hard SAFE_TEST_TIMEOUT_MS (default 600000 = 10 min) bounds the whole run; a timeout exits 124. The same kill+sweep runs on normal exit, on SIGINT/SIGTERM, and on spawn error.

A third belt-and-braces measure: .factory/run.ts sweeps stray host vitest at the start and end of each iteration — so an automated loop never inherits a previous run's leak.

How it was verified — the Proof Gate

"It should be fixed" is not proof. The fix was declared done only against an observable boundary: run the full suite via pnpm test:safe → 565 tests green, and immediately after, pgrep -f vitest returns empty. Green alone isn't enough; the empty process table is the part that proves no worker leaked.

# the verification, conceptually
pnpm test:safe            # bounded run in its own group → 565 passed
pgrep -f vitest           # → (no output): nothing leaked. THIS is the proof.

The operating rules that fell out of it

Rule	Why
Never run raw `pnpm -w test` in a loop or hand it to an automated builder — use `pnpm test:safe`.	The raw command has no group-kill; a single hang in a loop reproduces the leak.
After each loop iteration, sweep: `pgrep -f vitest \| xargs -r kill -9`.	A standing net even though the wrapper already does it.
Always `vitest run`, never `vitest` / watch mode.	Watch mode is a long-lived process — the opposite of what you want in CI/loops.
Any subprocess an orchestrator spawns: `{detached:true}` + kill the negative PGID, never a bare PID; always a hard timeout.	Generalizes the fix: the leak class is "a child tree outliving the killer."

1. Why did neither cause produce the leak on its own?

Correct: c. The immortal process is the intersection: a worker that will never exit on its own (the hang) detached from any parent that would reap it (the orphaning). That's why the fix has to address the interaction — two layers, one per cause.

2. What does process.kill(-child.pid, 'SIGKILL') do that process.kill(child.pid, …) would not?

Correct: b. Because the child was spawned detached:true (its own group via setsid), a negative PID reaches every descendant. A bare PID would kill only the vitest main and leave the forks to reparent — the exact bug.

3. Why isn't "the 565 tests pass" considered sufficient proof that the leak is fixed?

Correct: d. Green proves correctness; the empty process table proves no worker survived the run. The leak is a process-lifecycle bug, so the proof has to be observed at the process boundary — that's the Proof-Gate discipline.

Common confusions

"Just add a longer timeout." A longer timeout makes the hang take longer to fail — it does nothing about orphaning. The two-layer fix is deliberate: timeouts handle the hang, the group-kill+sweep handle the orphaning.

"forks vs threads is just performance." Here it's also safety: the forks pool isolates a hang in a child process Vitest force-kills on teardown, so a stuck file can't pin a core the way an un-killable thread might.

← Lesson 5 Lesson 7 →

Sources (all in the repo):
· Project memory alembic-vitest-orphan-leak — the incident record (16 workers, PPID=1, ~1550% CPU, ~11h), the two combined causes, the shipped fixes, and the mandatory loop rules.
· scripts/safe-test.mjs — the wrapper: spawn(...,{detached:true}), process.kill(-child.pid,…), pkill -9 -f vitest sweep, SAFE_TEST_TIMEOUT_MS, exit 124 on timeout (lines 18–83).
· vitest.config.ts — testTimeout/hookTimeout/teardownTimeout + pool:'forks' hardening (lines 13–28).
· .factory/run.ts — start-and-end-of-iteration sweep.
The "565 green + pgrep empty" verification is recorded in the project memory as the done-condition. ← Course hub · Português