Course / Lesson 6  ·  PT-BR
Lesson 06 · Engineering case study

The vitest-orphan-leak

A real incident from building this fusion. Two harmless-looking facts combined into 16 immortal processes pinning the machine. The fix is a small classic of defensive systems engineering — and a lesson in fixing the interaction, not just one cause.

The incident — 2026-06-23

16 orphaned node (vitest) workers. PPID = 1 (reparented to init), each at ~90–96% CPU, alive for ~11 hours, cwd in this repo. They had been spawned about one every 2 minutes over ~30 minutes by repeated host pnpm test runs.

16
orphaned workers (PPID=1)
~1550%
total CPU (load avg 34)
~11h
alive, spinning

The two causes — neither fatal alone

This is the heart of the lesson. The leak needed both of these to be true at once:

CauseWhat it is
1 · A hanging testA test that hangs with no teardown — a server/socket/MCP left open, a setInterval never cleared, an unresolved promise. On its own: annoying, but you'd notice and Ctrl-C it.
2 · Parent killed without killing the treeThe parent process (a Bash/session timeout) is killed without killing the worker tree. Vitest's tinypool forks then reparent to PID 1 and keep spinning. On its own (with tests that exit): harmless.
Why "either alone is survivable" matters. A hung test that you kill cleanly takes its workers with it. An orphaned worker that was going to exit anyway exits. It's the combination — a worker that will never exit on its own, detached from any parent that would reap it — that produces an immortal CPU-pinning process. Robust fixes target the interaction.
BEFORE (parent alive): pnpm test (parent) vitest main fork fork fork one tree — killing the parent reaps all AFTER (parent killed, tree not): parent ✗ killed PID 1 (init) fork 96% fork 94% fork 90% reparented to PID 1 — no one reaps them + they hang → spin forever

The fix — two layers, one per cause

A single patch wouldn't be robust: harden the config and a different hang still orphans; add the wrapper and a hang inside the group still pins a core until the wall-clock timeout. Both layers ship.

Layer 1 — make a hang fail, not hang (cause 1)

Bounded timeouts in vitest.config.ts (applied to both root projects via extends) turn an infinite hang into a bounded failure, and the forks pool isolates a stuck file in a child process Vitest force-kills on teardown — so a stuck file can't pin a core indefinitely.

// vitest.config.ts:20-27
// Anti-orphan hardening: a hung test (server/socket/MCP/interval/unresolved
// promise with no teardown) must FAIL on a bounded timeout, never hang a
// worker forever. The `forks` pool isolates hangs in child processes that
// Vitest force-kills on teardown, so a stuck file cannot pin a CPU core.
testTimeout: 15_000,
hookTimeout: 15_000,
teardownTimeout: 10_000,
pool: 'forks',

Layer 2 — make orphaning impossible (cause 2)

scripts/safe-test.mjs (exposed as pnpm test:safe) runs the suite in its own process group, under a hard wall-clock timeout, then kills the whole group — not just the parent PID — and sweeps any stray vitest as a last-resort net. Three techniques, each load-bearing:

// scripts/safe-test.mjs:34-44 — detached group + negative-PID kill
// detached:true => POSIX setsid => the child leads a NEW process group, so
// `kill(-pid)` reaches every descendant (vitest main + all tinypool forks).
const child = spawn(bin, args, { stdio: 'inherit', detached: true });

const killGroup = (signal) => {
  try { process.kill(-child.pid, signal); }   // NEGATIVE pid = the whole group
  catch { /* group already gone */ }
};
// scripts/safe-test.mjs:26-32 — the last-resort sweep
const sweep = () => {
  try { execFileSync('pkill', ['-9', '-f', 'vitest'], { stdio: 'ignore' }); }
  catch { /* nothing left to kill — pkill exits non-zero when no match */ }
};

A third belt-and-braces measure: .factory/run.ts sweeps stray host vitest at the start and end of each iteration — so an automated loop never inherits a previous run's leak.

How it was verified — the Proof Gate

"It should be fixed" is not proof. The fix was declared done only against an observable boundary: run the full suite via pnpm test:safe565 tests green, and immediately after, pgrep -f vitest returns empty. Green alone isn't enough; the empty process table is the part that proves no worker leaked.
# the verification, conceptually
pnpm test:safe            # bounded run in its own group → 565 passed
pgrep -f vitest           # → (no output): nothing leaked. THIS is the proof.

The operating rules that fell out of it

RuleWhy
Never run raw pnpm -w test in a loop or hand it to an automated builder — use pnpm test:safe.The raw command has no group-kill; a single hang in a loop reproduces the leak.
After each loop iteration, sweep: pgrep -f vitest | xargs -r kill -9.A standing net even though the wrapper already does it.
Always vitest run, never vitest / watch mode.Watch mode is a long-lived process — the opposite of what you want in CI/loops.
Any subprocess an orchestrator spawns: {detached:true} + kill the negative PGID, never a bare PID; always a hard timeout.Generalizes the fix: the leak class is "a child tree outliving the killer."
1. Why did neither cause produce the leak on its own?
Correct: c. The immortal process is the intersection: a worker that will never exit on its own (the hang) detached from any parent that would reap it (the orphaning). That's why the fix has to address the interaction — two layers, one per cause.
2. What does process.kill(-child.pid, 'SIGKILL') do that process.kill(child.pid, …) would not?
Correct: b. Because the child was spawned detached:true (its own group via setsid), a negative PID reaches every descendant. A bare PID would kill only the vitest main and leave the forks to reparent — the exact bug.
3. Why isn't "the 565 tests pass" considered sufficient proof that the leak is fixed?
Correct: d. Green proves correctness; the empty process table proves no worker survived the run. The leak is a process-lifecycle bug, so the proof has to be observed at the process boundary — that's the Proof-Gate discipline.

Common confusions

"Just add a longer timeout." A longer timeout makes the hang take longer to fail — it does nothing about orphaning. The two-layer fix is deliberate: timeouts handle the hang, the group-kill+sweep handle the orphaning.
"forks vs threads is just performance." Here it's also safety: the forks pool isolates a hang in a child process Vitest force-kills on teardown, so a stuck file can't pin a core the way an un-killable thread might.