Lesson 13 · Deep dive · subsystem 7 of 7

transcribe / analyzeImage — media ports

Two tools — speech-to-text and image understanding — built on the exact same ports pattern you've now seen six times. Each is a tiny dispatch kernel: validate the request (Zod), call the injected backend, re-validate the untrusted result (Zod), return a Result. The production backends are dependency-free fetch impls. This lesson is the closing proof that the discipline scales: a new capability is a port plus a kernel plus a thin fetch seam — nothing more. A CLONE of Hermes' transcription_tools.py (1799 LOC) + vision_tools.py, CLOUD path only.

Two backends, two kernels, one shape

Each tool's network call becomes an injected port — a single function returning a Result. The module imports no SDK:

// packages/hermes/src/media/types.ts:133-148
export type TranscriptionBackend = (
  req: TranscriptionRequest,
) => Promise<Result<TranscriptionResult, Error>>;

export type VisionBackend = (
  req: VisionRequest,
) => Promise<Result<VisionResult, Error>>;

// packages/hermes/src/media/media.ts:51-68 — transcribe
const parsedReq = transcriptionRequestSchema.safeParse(req);
if (!parsedReq.success) return err(new Error(`Invalid transcription request: …`));
const transcribed = await deps.backend(parsedReq.data);
if (!transcribed.ok) return transcribed;          // backend failure ⇒ err
const parsed = transcriptionResultSchema.safeParse(transcribed.value);
if (!parsed.success) return err(new Error(`Invalid transcription result: …`)); // untrusted out
return ok(parsed.data);

analyzeImage is byte-for-byte the same shape (validate request → backend → validate result), differing only in the schemas. That symmetry is the lesson: once the pattern is established, a new media tool is mechanical.

Cross-field validation: exactly one audio source

The transcription request models a portable audio source — a URL the backend fetches, or inline base64 — and enforces "exactly one" with a Zod refinement (an XOR over the two optionals):

// packages/hermes/src/media/types.ts:62-74 — transcriptionRequestSchema
z.object({
  audioUrl: z.string().url('audioUrl must be a valid URL').optional(),
  audioBase64: z.string().min(1, 'audioBase64 cannot be empty').optional(),
  mimeType: z.string().min(1).optional(),
}).refine(
  (req) => (req.audioUrl === undefined) !== (req.audioBase64 === undefined),
  { message: 'exactly one of audioUrl or audioBase64 is required' },
);

The !== over two === undefined checks is a boolean XOR: true only when exactly one source is present. Two tests pin both failure modes — neither source, and both sources, are each rejected at the boundary before the backend is ever called.

The envelope collapse. The Python source returns flat dicts: {success, transcript, provider, error} for STT, {success, analysis} for vision. Here, success/error collapse into the Result wrapper, and the success core is trimmed: transcript → the engine-idiomatic text (with optional provider provenance), analysis stays analysis. The win: a failure can't masquerade as empty text — it's an err, structurally distinct from ok({text:''}) (legitimate silence).

The fetch backends — defensive mapping, field fallbacks

Like the web backend, the media backends are thin fetch impls over the global fetch, with defensive payload mapping and field fallbacks. The kernel re-validates, so the mapper can be forgiving:

// packages/hermes/src/media/fetch-backends.ts:163-178 — defensive row mapping
const mapTranscriptionRow = (payload) => {
  const provider = readField(payload, 'provider');
  return {
    text: asString(readField(payload, 'text') ?? readField(payload, 'transcript')), // fallback
    ...(typeof provider === 'string' && provider.length > 0 ? { provider } : {}),
  };
};
const mapVisionRow = (payload) => ({
  analysis: asString(readField(payload, 'analysis') ?? readField(payload, 'content')), // fallback
});

Why the local ML path is IGNORED — a deliberate matrix decision

The source supports six cloud STT providers and a local faster-whisper path (Python ML, GPU/model-download territory). The fusion matrix marks the local path IGNORE: it's Python-ML-bound and out of scope for a portable TypeScript kernel, while all six cloud providers collapse to one injected port (they all just POST audio to an endpoint). This is the discipline working as intended — clone the portable structure, ignore what doesn't translate, and say so explicitly. Tests inject a fake fetch proving the mapping and the transport fail-closed paths (non-2xx, network throw, unparseable JSON each → err) without opening a socket. One test even proves a non-string payload field fails closed via the kernel's Zod gate — defense in depth, again.

1. A transcription request supplies both audioUrl and audioBase64. What happens?

Correct: c. The schema's .refine is an XOR: (audioUrl===undefined) !== (audioBase64===undefined) is true only when exactly one is present. Both (or neither) fails closed with err — never throws, never reaches the backend.

2. Why do success/error not appear as fields on TranscriptionResult?

Correct: b. The Python flat envelope's success/error become the Result's ok/err. The benefit is that an empty transcript (real silence) is ok({text:''}), never confused with a failure, which is err.

3. Why was the source's local faster-whisper transcription path marked IGNORE in the fusion?

Correct: d. The matrix's CLONE/ADAPT/MERGE/IGNORE dispositions are deliberate. The local ML path doesn't translate to a dependency-free TS kernel, so it's explicitly ignored; the cloud STT dispatch — which is just "POST audio to an endpoint" — becomes one TranscriptionBackend port.

Common confusions

"This wires up a real transcription provider." No — it's the STRUCTURE ported to ports-and-injection. The kernel imports no SDK; createFetchTranscriptionBackend/createFetchVisionBackend are thin generic-JSON seams over an injectable fetch, so tests never open a socket.

"Empty transcript means it failed." No — an empty text is legitimate (silence) and arrives as ok({text:''}). A failure is an err. Keeping success/error out of the payload and in the Result is exactly what makes the two unambiguous.

You've now seen the pattern seven times. Memory, learning, curator, clarify, web, skills, media — every shipped @alembic/hermes subsystem obeys the same discipline: inject the ports, return Result, validate untrusted input with Zod, never throw, no Date.now()/Math.random(). That's Lesson 5 made concrete, seven times over. Re-read Lesson 5 now and it should read like a summary of everything above.

← Lesson 12 Lesson 14 →

Sources (all in the repo, read verbatim):
· packages/hermes/src/media/media.ts — transcribe (51–68), analyzeImage identical shape (76–93).
· packages/hermes/src/media/types.ts — transcriptionRequestSchema XOR refine (62–74), result envelopes text/provider/analysis (85–121), TranscriptionBackend/VisionBackend ports (133–148), local faster-whisper IGNORED (15–24).
· packages/hermes/src/media/fetch-backends.ts — injectable global fetch (90/117), postJson fail-closed (139–155), mapTranscriptionRow/mapVisionRow field fallbacks (163–178).
· packages/hermes/src/media/media.test.ts — 20 cases incl. neither/both audio source (50–71), malformed result (86–95, 131–138), fake-fetch transport errors (207–248), non-string field fails closed (248).
· CLONE provenance: docs/hermes-complete-map.md §3.5 + §3.6; fusion matrix IGNORE of the local ML path. ← Course hub · Português