Two tools — speech-to-text and image understanding — built on the exact same ports pattern you've now seen six times. Each is a tiny dispatch kernel: validate the request (Zod), call the injected backend, re-validate the untrusted result (Zod), return a Result. The production backends are dependency-free fetch impls. This lesson is the closing proof that the discipline scales: a new capability is a port plus a kernel plus a thin fetch seam — nothing more. A CLONE of Hermes' transcription_tools.py (1799 LOC) + vision_tools.py, CLOUD path only.
Each tool's network call becomes an injected port — a single function returning a Result. The module imports no SDK:
// packages/hermes/src/media/types.ts:133-148 export type TranscriptionBackend = ( req: TranscriptionRequest, ) => Promise<Result<TranscriptionResult, Error>>; export type VisionBackend = ( req: VisionRequest, ) => Promise<Result<VisionResult, Error>>;
// packages/hermes/src/media/media.ts:51-68 — transcribe const parsedReq = transcriptionRequestSchema.safeParse(req); if (!parsedReq.success) return err(new Error(`Invalid transcription request: …`)); const transcribed = await deps.backend(parsedReq.data); if (!transcribed.ok) return transcribed; // backend failure ⇒ err const parsed = transcriptionResultSchema.safeParse(transcribed.value); if (!parsed.success) return err(new Error(`Invalid transcription result: …`)); // untrusted out return ok(parsed.data);
analyzeImage is byte-for-byte the same shape (validate request → backend → validate result), differing only in the schemas. That symmetry is the lesson: once the pattern is established, a new media tool is mechanical.
The transcription request models a portable audio source — a URL the backend fetches, or inline base64 — and enforces "exactly one" with a Zod refinement (an XOR over the two optionals):
// packages/hermes/src/media/types.ts:62-74 — transcriptionRequestSchema z.object({ audioUrl: z.string().url('audioUrl must be a valid URL').optional(), audioBase64: z.string().min(1, 'audioBase64 cannot be empty').optional(), mimeType: z.string().min(1).optional(), }).refine( (req) => (req.audioUrl === undefined) !== (req.audioBase64 === undefined), { message: 'exactly one of audioUrl or audioBase64 is required' }, );
The !== over two === undefined checks is a boolean XOR: true only when exactly one source is present. Two tests pin both failure modes — neither source, and both sources, are each rejected at the boundary before the backend is ever called.
{success, transcript, provider, error} for STT, {success, analysis} for vision. Here, success/error collapse into the Result wrapper, and the success core is trimmed: transcript → the engine-idiomatic text (with optional provider provenance), analysis stays analysis. The win: a failure can't masquerade as empty text — it's an err, structurally distinct from ok({text:''}) (legitimate silence).Like the web backend, the media backends are thin fetch impls over the global fetch, with defensive payload mapping and field fallbacks. The kernel re-validates, so the mapper can be forgiving:
// packages/hermes/src/media/fetch-backends.ts:163-178 — defensive row mapping const mapTranscriptionRow = (payload) => { const provider = readField(payload, 'provider'); return { text: asString(readField(payload, 'text') ?? readField(payload, 'transcript')), // fallback ...(typeof provider === 'string' && provider.length > 0 ? { provider } : {}), }; }; const mapVisionRow = (payload) => ({ analysis: asString(readField(payload, 'analysis') ?? readField(payload, 'content')), // fallback });
The source supports six cloud STT providers and a local faster-whisper path (Python ML, GPU/model-download territory). The fusion matrix marks the local path IGNORE: it's Python-ML-bound and out of scope for a portable TypeScript kernel, while all six cloud providers collapse to one injected port (they all just POST audio to an endpoint). This is the discipline working as intended — clone the portable structure, ignore what doesn't translate, and say so explicitly. Tests inject a fake fetch proving the mapping and the transport fail-closed paths (non-2xx, network throw, unparseable JSON each → err) without opening a socket. One test even proves a non-string payload field fails closed via the kernel's Zod gate — defense in depth, again.
audioUrl and audioBase64. What happens?.refine is an XOR: (audioUrl===undefined) !== (audioBase64===undefined) is true only when exactly one is present. Both (or neither) fails closed with err — never throws, never reaches the backend.success/error not appear as fields on TranscriptionResult?success/error become the Result's ok/err. The benefit is that an empty transcript (real silence) is ok({text:''}), never confused with a failure, which is err.faster-whisper transcription path marked IGNORE in the fusion?TranscriptionBackend port.createFetchTranscriptionBackend/createFetchVisionBackend are thin generic-JSON seams over an injectable fetch, so tests never open a socket.text is legitimate (silence) and arrives as ok({text:''}). A failure is an err. Keeping success/error out of the payload and in the Result is exactly what makes the two unambiguous.@alembic/hermes subsystem obeys the same discipline: inject the ports, return Result, validate untrusted input with Zod, never throw, no Date.now()/Math.random(). That's Lesson 5 made concrete, seven times over. Re-read Lesson 5 now and it should read like a summary of everything above.