How I work with Claude

A method, not magic.

I treat AI pair-engineering as a discipline, not a shortcut: written specs, verification gates, and harvested learnings. Claude multiplies my output only because the process around it is strict. I built this portfolio with Claude Design and Claude Code, reviewed it commit by commit, and the page you are reading is that process running in public.

Principles

Spec before prompt

I write the brief — schema, constraints, acceptance criteria — before Claude generates anything. On the Zelebrix RAG build I specified hybrid retrieval, PII scrubbing, and local data queries up front, then had Claude implement against that.

PREVENTS: the model inventing requirements, and a payments tenant’s data leaking past a boundary I never named.

Claude proposes, evals dispose

Acceptance is decided by tests, not by how the output reads. At Gestamp my golden dataset with ground-truth results gates every Text-to-SQL deploy at 100% on controlled tests; at Zelebrix a 150-question golden set plus a retrieval probe scores each change.

PREVENTS: “looks right” shipping, and a wrong SQL answer breaking trust silently.

Measure before you blame the model

I make Claude prove the bottleneck with data before I let it tune anything. On Zelebrix an LLM benchmark showed the failure was retrieval — synonym gaps — not the generation model.

PREVENTS: effort flowing to the loudest suspect, and the real defect surviving a redesign that changed nothing.

Review every commit

I read AI-assisted work commit by commit before it lands, the same way I review intern PRs at Gestamp and give architectural guidance. Claude drafts; I accept, reject, or rewrite each diff.

PREVENTS: unreviewed changes accreting, and a subtle regression in retrieval or impersonation shipping under my name.

Harvest everything

Every build mines a reusable artifact — the Gestamp eval framework, the Zelebrix RAGAS-lite scorer, the IDEA ingestion and retrieval libraries. Claude helps generalize one project’s harness into the next.

PREVENTS: hard-won workflows dying in old repos — exactly the gap my planned skills-and-harnesses collection closes.

The workflow

Guillermo Claude

Frame

Writes the problem, constraints, and acceptance criteria.

Restates the brief and surfaces ambiguities.

EXIT ARTIFACT: an agreed spec with named failure modes.

Scaffold

Defines the module contracts (endpoint factory, ingestion/retrieval libraries).

Generates the first implementation.

EXIT ARTIFACT: a running skeleton that compiles and wires together.

Iterate

Asks for small, single-purpose changes.

Edits one slice and runs it.

EXIT ARTIFACT: a reviewed commit that does one thing.

Verify

Writes the eval set and edge cases.

Generates the test harness and runs it.

EXIT ARTIFACT: a green run on the golden set before deploy.

Harvest

Names what should outlive the project.

Extracts it into a reusable skill or library.

EXIT ARTIFACT: a harness ready for the next build.

DISCIPLINE

Where it breaks — and what I do about it:

Claude over-trusts its first architecture. I demand alternatives with trade-offs and benchmark them — the Zelebrix LLM comparison settled a design call with data, not preference.
It cannot judge its own retrieval quality. I own eval design — golden sets, ground-truth results, retrieval probes — because a model grading itself reports comfort, not correctness.
It does not know my privacy and permission boundaries. I specify and verify them by hand — PII scrubbing, local data queries, RBAC impersonation — because a leaked boundary is not something a prompt fixes after the fact.

Toolchain

Claude Code	The implementation loop: scaffolding, edits, and the commit-by-commit reviews I gate every change through.
Claude Design	Visual and UX iteration — this site is the demo, built without prior front-end stack knowledge.
Skills & harnesses	Codified practice: my eval frameworks, retrieval libraries, and scorers, carried forward between builds.
Evals / RAGAS	Acceptance gates — golden sets, retrieval probes, and the RAGAS module wired into the IDEA platform.
Langfuse	Observability on the IDEA RAG platform: tracing what the agents actually did in production.

Worked example

Full annotated build log coming from the LLMwiki project. Meanwhile, the working-with-Claude notes on each project page — Zelebrix, IDEA, and the Gestamp agents — show this method in action on shipped work.

A spec the generator
cannot deviate from.