Course / Lesson 25  ·  PT-BR
Lesson 25 · Advanced · deepens Lesson 06

Test-safety engineering: the kill that can't miss

Lesson 6 told the story of the orphaned-worker leak — 16 stray vitest processes pinning ~1550% CPU. This lesson is the engineering underneath the fix: how a UNIX process group works, why detached:true creates one, why kill(-pgid) reaches every descendant where kill(pid) can't, and how the two-layer defense — a hardened vitest.config.ts plus a process-group-killing safe-test.mjs wrapper plus a post-run sweep — makes the leak structurally impossible. It's a small file, but every line is load-bearing, and the lesson generalizes to any tool that spawns a worker tree.

The root problem: orphans escape the parent

Vitest's tinypool runs tests in worker child processes. If a test hangs with no teardown (an open socket, an unresolved promise, a live interval) and the parent is killed by PID alone, the workers don't die — they get reparented to PID 1 (the init process) and keep running, each pinning a core. Killing the parent process is not enough: you have to kill the whole tree, and a tree that has already reparented is no longer reachable from the parent at all.

kill(pid) — parent only parent ✗ killed worker ⟳ worker ⟳ → workers reparent to PID 1, keep pinning cores kill(-pgid) — whole group process group (one setsid leader) parent ✗ worker ✗ worker ✗ → the negative PID addresses every member at once; nothing survives defense in depth = config (fail on timeout) + group-kill + sweep (catch the escapee) pkill -9 -f vitest — last resort for anything that already escaped to PID 1

Layer 1 — make a hang FAIL instead of hang (vitest.config.ts)

The first defense stops the hang from happening in the first place. The shared config sets bounded timeouts and the forks pool, so a stuck file fails fast and is force-killed on teardown rather than spinning forever:

// vitest.config.ts:13-28 — the anti-orphan hardening
export default defineConfig({
  test: {
    environment: 'node',
    // A hung test (server/socket/MCP/interval/unresolved promise with no
    // teardown) must FAIL on a bounded timeout, never hang a worker forever.
    testTimeout: 15_000,
    hookTimeout: 15_000,
    teardownTimeout: 10_000,
    pool: 'forks',   // isolates hangs in child processes Vitest force-kills on teardown
  },
});

Why pool:'forks' and not the default worker threads? A hang inside a thread can wedge the host process; a hang inside a fork (a separate OS process) is isolated and "Vitest force-kills on teardown, so a stuck file cannot pin a CPU core." The timeouts turn an infinite wait into a test failure — visible, bounded, and CI-red.

Layer 2 — the wrapper that owns a process group

Config alone can't cover every escape (a native handle, a SIGKILL'd parent mid-run). So scripts/safe-test.mjs runs the whole suite in its own process group and kills the group, not the PID. The key is detached:true:

// scripts/safe-test.mjs:34-44
// detached:true => POSIX setsid => the child leads a NEW process group, so
// kill(-pid) reaches every descendant (vitest main + all tinypool forks).
const child = spawn(bin, args, { stdio: 'inherit', detached: true });

const killGroup = (signal) => {
  try {
    process.kill(-child.pid, signal);   // NEGATIVE pid = the whole group
  } catch {
    /* group already gone */
  }
};
The one idea to internalize: a negative PID is a process group

On POSIX, kill(pid, sig) signals one process; kill(-pgid, sig) signals every process in that group. detached:true calls setsid() so the child becomes a group leader — its PID is the group id. So process.kill(-child.pid, …) reaches the vitest main process and every tinypool fork in one syscall. That's the difference between "killed the parent, orphaned the kids" and "killed the family."

The hard timeout: SIGTERM, then SIGKILL, then sweep

A bounded wall-clock timer escalates politely-then-forcibly. SIGTERM first (let it clean up), SIGKILL 5 seconds later (force it), then a sweep, and exit 124 (the conventional timeout code):

// scripts/safe-test.mjs:46-58
const timer = setTimeout(() => {
  timedOut = true;
  process.stderr.write(`\n[safe-test] HARD TIMEOUT after ${TIMEOUT_MS}ms — killing process group -${child.pid}\n`);
  killGroup('SIGTERM');                 // ask nicely
  setTimeout(() => {
    killGroup('SIGKILL');               // then force, 5s later
    sweep();
    process.exit(124);                 // conventional "timed out" code
  }, 5_000);
}, TIMEOUT_MS);
timer.unref();                          // don't keep the event loop alive for the timer alone

Layer 3 — the sweep: catch the one that already escaped

If a worker reparented to PID 1 before the group-kill, it's no longer in the group — the group-kill can't reach it. The last-resort sweep does, mirroring the operator's manual net pgrep -f vitest | kill -9:

// scripts/safe-test.mjs:24-32
const sweep = () => {
  try {
    execFileSync('pkill', ['-9', '-f', 'vitest'], { stdio: 'ignore' });
  } catch {
    /* nothing left to kill — pkill exits non-zero when no match */
  }
};

The sweep runs on every exit path — normal exit, signal, and timeout — so leakage "must not accumulate" (safe-test.mjs:80). On a clean exit the wrapper still calls killGroup('SIGKILL') to "reap any fork still lingering in the group," then sweeps. Belt and suspenders, because the cost of a leak is hours of pinned CPU.

Why three layers and not one

LayerCatchesMisses (handed to next)
config timeouts + forksmost hangs — they fail fast and Vitest force-kills the forka parent killed externally mid-run; a native handle Vitest can't reap
group-kill (detached)the whole live tree in one syscalla worker that already reparented to PID 1 before the kill
pkill sweepany stray vitest by name, including PID-1 orphans— (the floor)

Each layer's miss is the next layer's job. That is defense in depth: no single mechanism is trusted to be perfect, and the failure mode (a pinned core for hours) is severe enough to justify the redundancy.

1. Why does safe-test.mjs spawn the suite with detached:true?
Correct: b. detached:truesetsid ⇒ the child is a group leader whose PID is the group id. A negative PID in kill addresses the whole group, so one syscall reaches the entire worker tree — exactly what a plain kill(pid) cannot do.
2. A worker reparents to PID 1 before the group-kill fires. Which layer catches it?
Correct: d. Once reparented to PID 1, the worker is outside the original process group, so kill(-pgid) can't reach it. The name-based sweep is precisely the last-resort net for that case, and it runs on every exit path.
3. Why set bounded testTimeout/teardownTimeout and use pool:'forks' instead of relying only on the kill-the-group wrapper?
Correct: c. Defense in depth starts upstream: prevent the hang from pinning anything by failing fast in an isolated fork. The wrapper + sweep are the backstops for the cases the config can't cover (external kill mid-run, native handles).

Common confusions

"kill(pid) kills the children too." It doesn't — it signals one process. Children survive and reparent to PID 1. You need the process-group form (kill(-pgid)) to reach the tree, which is exactly why detached:true exists in the wrapper.
"The sweep is overkill if the group-kill works." The group-kill can't reach a process that already left the group (reparented to PID 1). The sweep is not redundant — it covers a case the group-kill structurally cannot. The verification (Lesson 6) confirms both: 565 green and pgrep -f vitest empty.