A real incident from building this fusion. Two harmless-looking facts combined into 16 immortal processes pinning the machine. The fix is a small classic of defensive systems engineering — and a lesson in fixing the interaction, not just one cause.
16 orphaned node (vitest) workers. PPID = 1 (reparented to init), each at ~90–96% CPU, alive for ~11 hours, cwd in this repo. They had been spawned about one every 2 minutes over ~30 minutes by repeated host pnpm test runs.
This is the heart of the lesson. The leak needed both of these to be true at once:
| Cause | What it is |
|---|---|
| 1 · A hanging test | A test that hangs with no teardown — a server/socket/MCP left open, a setInterval never cleared, an unresolved promise. On its own: annoying, but you'd notice and Ctrl-C it. |
| 2 · Parent killed without killing the tree | The parent process (a Bash/session timeout) is killed without killing the worker tree. Vitest's tinypool forks then reparent to PID 1 and keep spinning. On its own (with tests that exit): harmless. |
A single patch wouldn't be robust: harden the config and a different hang still orphans; add the wrapper and a hang inside the group still pins a core until the wall-clock timeout. Both layers ship.
Bounded timeouts in vitest.config.ts (applied to both root projects via extends) turn an infinite hang into a bounded failure, and the forks pool isolates a stuck file in a child process Vitest force-kills on teardown — so a stuck file can't pin a core indefinitely.
// vitest.config.ts:20-27 // Anti-orphan hardening: a hung test (server/socket/MCP/interval/unresolved // promise with no teardown) must FAIL on a bounded timeout, never hang a // worker forever. The `forks` pool isolates hangs in child processes that // Vitest force-kills on teardown, so a stuck file cannot pin a CPU core. testTimeout: 15_000, hookTimeout: 15_000, teardownTimeout: 10_000, pool: 'forks',
scripts/safe-test.mjs (exposed as pnpm test:safe) runs the suite in its own process group, under a hard wall-clock timeout, then kills the whole group — not just the parent PID — and sweeps any stray vitest as a last-resort net. Three techniques, each load-bearing:
// scripts/safe-test.mjs:34-44 — detached group + negative-PID kill // detached:true => POSIX setsid => the child leads a NEW process group, so // `kill(-pid)` reaches every descendant (vitest main + all tinypool forks). const child = spawn(bin, args, { stdio: 'inherit', detached: true }); const killGroup = (signal) => { try { process.kill(-child.pid, signal); } // NEGATIVE pid = the whole group catch { /* group already gone */ } };
// scripts/safe-test.mjs:26-32 — the last-resort sweep const sweep = () => { try { execFileSync('pkill', ['-9', '-f', 'vitest'], { stdio: 'ignore' }); } catch { /* nothing left to kill — pkill exits non-zero when no match */ } };
detached:true → the child leads a new process group (POSIX setsid), so a single signal can reach every descendant.process.kill(-child.pid, …) → the negative PID signals the whole group, including tinypool forks — never a bare PID.pkill -9 -f vitest sweep → catches anything that already escaped to PID 1 (the orphaning case), so leakage can't accumulate across runs.SAFE_TEST_TIMEOUT_MS (default 600000 = 10 min) bounds the whole run; a timeout exits 124. The same kill+sweep runs on normal exit, on SIGINT/SIGTERM, and on spawn error.A third belt-and-braces measure: .factory/run.ts sweeps stray host vitest at the start and end of each iteration — so an automated loop never inherits a previous run's leak.
pnpm test:safe → 565 tests green, and immediately after, pgrep -f vitest returns empty. Green alone isn't enough; the empty process table is the part that proves no worker leaked.# the verification, conceptually pnpm test:safe # bounded run in its own group → 565 passed pgrep -f vitest # → (no output): nothing leaked. THIS is the proof.
| Rule | Why |
|---|---|
Never run raw pnpm -w test in a loop or hand it to an automated builder — use pnpm test:safe. | The raw command has no group-kill; a single hang in a loop reproduces the leak. |
After each loop iteration, sweep: pgrep -f vitest | xargs -r kill -9. | A standing net even though the wrapper already does it. |
Always vitest run, never vitest / watch mode. | Watch mode is a long-lived process — the opposite of what you want in CI/loops. |
Any subprocess an orchestrator spawns: {detached:true} + kill the negative PGID, never a bare PID; always a hard timeout. | Generalizes the fix: the leak class is "a child tree outliving the killer." |
process.kill(-child.pid, 'SIGKILL') do that process.kill(child.pid, …) would not?detached:true (its own group via setsid), a negative PID reaches every descendant. A bare PID would kill only the vitest main and leave the forks to reparent — the exact bug.forks vs threads is just performance." Here it's also safety: the forks pool isolates a hang in a child process Vitest force-kills on teardown, so a stuck file can't pin a core the way an un-killable thread might.