Course / Lesson 2  ·  PT-BR
Lesson 02 · The system we learned from

Reverse-Engineering Hermes

The Hermes Agent is Nous Research's "self-improving AI agent" — a million-line Python codebase. You cannot fuse what you do not understand, so the fusion began with a map. This lesson is that map's core: what Hermes is, the one structural idea that makes it extensible, and the closed learning loop that became the whole reason for the fusion.

~2,497
Python files
~1.18M
lines of code
87
tool files in tools/
30
toolset keys

Scale reality: the map is not a line-by-line read of all 1.18M LOC — it deep-dives the priorities and catalogs the long tail, with the coverage frontier stated explicitly. Source: docs/hermes-complete-map.md §1, §6.

What Hermes is

Hermes is a personal AI agent that runs the same agent core across many surfaces — a CLI, a messaging gateway (Telegram, Discord, Slack, and ~20 other platforms), an Ink/React TUI, and an Electron desktop app. Built by Nous Research, MIT-licensed. Its headline differentiators, in its own words:

The two design constraints everything is downstream of

1 · Per-conversation prompt caching is sacred. A long conversation reuses a cached prefix every turn; anything that mutates past context or rebuilds the system prompt mid-conversation invalidates the cache and multiplies cost. The only sanctioned exception is context compression. 2 · The core is a narrow waist; capability lives at the edges. Every model tool is sent on every API call, so the bar for a new core tool is high — new capability arrives as a CLI command + skill or a plugin, not as core surface.

Notice the rhyme with Lesson 1: Hermes and Alembic independently arrived at "a narrow waist + cache-stable prompt." That shared instinct is exactly why the fusion composes instead of fighting — and it is the thread we pull on in Lesson 3.

The linchpin: the tool-registration architecture

This is the single most important and most portable structural idea in Hermes. Tools self-register at import time; nothing maintains a central import list. Each tool file calls registry.register(...) at module level; a discovery pass globs the directory and imports every file, and the import side-effect populates the registry.

run_agent.py · cli.py · batch_runner.py entry points trigger discovery model_tools.py imports registry + runs discover_builtin_tools() tools/*.py  — 87 files each calls registry.register(...) at module level (import side-effect) tools/registry.py no deps — owns schema, dispatch, availability, limits

There is a crucial nuance the map insists on — and it is the source of a number that gets mis-quoted. Auto-discovery registers a tool's schema, but the tool is only exposed to an agent if its name appears in a toolset in toolsets.py. So three different counts describe three different things:

CountWhat it actually measures
87tool files in tools/*.py (verified ls | wc -l). Includes infra files like registry.py, security scanners, and shared helpers — not all are agent-facing tools.
30toolset keys in toolsets.py (browser, memory, skills, web, …) — the named bundles a platform inherits from.
"64 tools"The figure the orange-book learning material cites for agent-facing tools. We keep the file/tool distinction explicit rather than collapsing all three into one number.
Why this matters for the map. "How many tools does Hermes have?" has no single right answer — it depends whether you count files, toolset keys, or exposed tools. A faithful reverse-engineering states all three and what each measures, instead of picking the roundest.

The closed learning loop — the "self-improving" claim, mechanically (§1.10)

The headline property is concrete in-code, not marketing. After a turn, the agent can fork itself to review what just happened and decide whether anything is worth remembering — and the writes land in durable stores the next session reads.

a user turn run_conversation() background review fork daemon thread, forked AIAgent whitelist: memory + skill tools only durable stores MEMORY.md / USER.md + agent-authored skills curator lifecycle next session loads the refreshed snapshot — the main conversation + prompt cache are NEVER touched

The pieces, each concrete in source:

The one sentence to carry into Lesson 3

For Alembic, this whole loop — fork-after-turn → self-review under a memory/skill-only whitelist → write to durable stores → curator lifecycle → next session reads the refreshed snapshot — is the highest-value conceptual CLONE, and it composes cleanly with Alembic's existing validator/gate pipeline.

The mini-loop discipline (§5)

The hermes-mini-loop repo is a minimal reference implementation of the same idea, and it states the discipline as three rules — the rules that keep a self-improving loop from drowning in its own noise:

RuleWhat it prevents
Learn only from wins (score ≥ 0.7)Sedimenting lessons from failed or low-confidence turns. Only validated successes become durable memory.
Reinforce, don't duplicateRe-seeing the same fact should strengthen the existing entry, not append a near-copy that bloats the bounded store.
Recombine proven atomsNew skills are composed from already-validated pieces rather than invented wholesale — improvement by recombination.

Hold onto the 0.7 threshold and "reinforce-don't-duplicate" — both reappear, ported verbatim, as the default gate and the dedup behavior of Alembic's loop in Lesson 4.

Common confusions

"Hermes is a model gateway." Its gateway is a messaging gateway (Telegram/Discord/…), not a model gateway. The model-provider role is distributed across providers/ + plugins/model-providers/ + a local OpenAI-compatible proxy — a different subsystem entirely.
"The background reviewer changes the current conversation." No — it is a fork. It writes to durable stores; the live conversation and its prompt cache are never mutated. "Self-improving" is literal but deferred: the improvement shows up on the next session's load.
"87 = the number of tools." 87 is the number of files in tools/. Exposure is gated by the 30 toolset keys; the orange-book counts ~64 agent-facing tools. Three numbers, three meanings.
1. In Hermes, what is the relationship between the 87 and the 30?
Correct: b. Auto-discovery registers a tool's schema (87 files, many of them infra/helpers), but a tool is only exposed if its name is in a toolset (30 keys). The orange-book's "64 tools" is yet a third figure — agent-facing tools. A faithful map keeps the file/tool distinction.
2. Why does the background review run in a forked agent on a daemon thread, with a memory/skill-only tool whitelist?
Correct: d. Constraint #1 (prompt caching is sacred) forbids mutating the live conversation. A fork inherits the cached prompt verbatim, writes to durable stores, and leaves the main turn untouched — so "self-improving" is real but deferred to the next session.
3. The mini-loop's "learn only from wins (score ≥ 0.7)" rule exists to prevent what?
Correct: c. A self-improving loop that learns from everything quickly poisons itself. The 0.7 floor (plus "reinforce, don't duplicate" and "recombine proven atoms") keeps only validated wins — and that exact threshold is ported into Alembic's gate in Lesson 4.