Lesson 19 · Engine & method · 6 of 8

The swarm: orchestrator → lead → worker

The L3 swarm is how Alembic runs many units of work concurrently and survives crashes. A depth-bounded 3-tier orchestrator over a dependency-gated task queue, with filesystem-as-truth state, git-worktree isolation, a hard T4 park, and crash-safe resume. This is the package the fusion matrix says made Hermes' delegate_tool.py an IGNORE — Alembic already delegates, natively and better. Source: packages/swarm/.

Three tiers, one level of nesting

The role hierarchy is orchestrator → lead → worker, and it is bounded: MAX_DEPTH = 2. A node at max depth is a leaf and may not spawn; subtasks are leaf tasks, so nesting is structurally one level deep. This is a deliberate guard against runaway recursive fan-out:

The queue — dependency-gated, ready means ready

Only ready tasks run. A task becomes ready when all of its dependsOn reach terminal-success (done). The orchestrator never runs a task whose prerequisites haven't actually succeeded — the dependency graph is honored, not hoped for. Worker kinds:

Worker kind	What runs
model worker	an adapter call (the `run()` from lesson 14)
command worker	`taskSpec.command` → a real subprocess; exit 0 → `done`
background worker	`taskSpec.background` → a detached child that survives parent death and re-attaches to its report on resume (requires `command`, forbids `isolate`)

Crash-safe resume — filesystem is the truth

State is durable on disk: an append-only events.jsonl journal plus a checkpoint.json. Resume replays the last checkpoint and every task-state event (idempotent, latest-per-id wins). The clever bit is how it handles a worker that died mid-flight:

// packages/swarm/src/orchestrator.ts:706-755 (condensed) — replayInto
// replay checkpoint + every task-state event (latest-per-id wins, idempotent)
// an orphaned 'running' task (worker died) is demoted to 'ready' and re-attempted
//   ⇒ at-least-once execution; enqueue is idempotent so the monitor's denominator
//     covers every task even after a crash mid-declaration

Every value crossing a durability boundary is Zod-parsed on read, so a corrupt or hand-edited journal line is rejected at the boundary rather than silently trusted (filesystem-as-truth, invariant ③ from lesson 16). Global id uniqueness across the whole tree is validated before anything is journaled, so a duplicate id can't poison a resume.

Isolation and the T4 park — fail-closed, both

Worktree isolation fail-closed

isolate: true without worktree config is an error, never a silent un-isolated run. If you asked for isolation and the environment can't provide it, the run stops — it does not quietly execute your task against the shared working tree. A command worker that must not touch the main checkout gets a git worktree via withWorktree, or it doesn't run at all.

Hard T4 park

Irreversible / legal / security / T4 tasks are routed to the park ledger and never auto-executed. The park is idempotent across resumes and structurally guarded — a parked task is immovable. Park reasons are a closed set: tier-t4 / irreversible / legal / security / manual. This is the swarm-level expression of the human-gate principle: the engine refuses to take the dangerous step on its own, and you reopen it with alembic approve/reject/propose.

The reward signal (computeReward) is a heuristic shaped scalar — PARL-style, not reinforcement learning — and it is HITL-gated by requiresApproval. The name evokes RL; the implementation is a deterministic heuristic with a human in the loop.

1. A lead at depth 1 spawns workers. Can one of those workers spawn its own sub-workers?

Correct: c. Depth is bounded at 2. A worker is a leaf; subtasks are leaf tasks, so nesting is structurally one level. This prevents runaway recursive fan-out — the depth bound is an invariant, not a setting.

2. A worker process is killed mid-task and the run is later resumed. What happens to that task?

Correct: b. replayInto replays the checkpoint plus task-state events; a task left running by a dead worker is demoted to ready and re-run. Combined with idempotent enqueue, resume covers every task even after a mid-declaration crash.

3. A task spec sets isolate: true but no worktree config is provided. What does the orchestrator do?

Correct: d. Worktree isolation is fail-closed: asking for isolation you can't get stops the run. The engine never quietly executes a task against the shared checkout when you explicitly requested separation.

Common confusions

"PARL reward means it learns by reinforcement." No — computeReward is a heuristic shaped scalar, explicitly not RL, and it's HITL-gated by requiresApproval. The reward shapes prioritization with a human in the loop; there is no gradient, no training.

"Background workers run truly async, freeing the slot." Not yet — a known gap: background workers still block the task slot while polling for the report; the async-drain enhancement is a future slice (the dispatcher seam already accommodates it). The course states gaps honestly rather than overclaiming.

← Lesson 18 Lesson 20 →

Sources (all in the repo, read verbatim):
· packages/swarm/src/index.ts (6–21) — the package's own summary. packages/swarm/src/types.ts — MAX_DEPTH(2)/ROLE_DEPTH (96+), dependency-gating (66–72), worker kinds (84–108), reward heuristic (168–184), content-addressed run-id reactivity (247–257).
· packages/swarm/src/orchestrator.ts — never-throws (158–242), canSpawn at depth (107), hard T4 park (244–271), crash-safe replayInto (706–755), global id uniqueness (141–150), worktree fail-closed (503–522), runIdFor (168).
· docs/alembic-complete-map.md §7.6; docs/alembic-hermes-fusion-matrix.md §2.4 (delegate = IGNORE because swarm exists). ← Course hub · Português