Gestamp Agents

Production multi-agent assistant that answers staff questions from governed enterprise data.

Shipped LAST UPDATED: 2026-06 ROLE: Built and extended the multi-agent system; set plugin contribution criteria

Gestamp, Madrid; internal data and schemas kept confidential.

Problem

Staff needed answers that lived across disconnected systems: structured tables in Databricks, documents in Blob Storage, pages in SharePoint, and tickets in a Service Desk. Finding one answer meant knowing which system held it, having the right permissions, and translating a business question into a query. That work fell to whoever could read schemas and write SQL. The bottleneck was access and translation, not data: the answers already existed, but reaching them was slow and uneven.

Constraints

Every query must run under the requesting user’s permissions.
Client data and schemas stay confidential; no internal details exposed.
Answers must be correct before deploy, not after.
Must fit the corporate Azure and Databricks stack.
Secrets managed centrally, never in code or config.

Approach

I worked on a production multi-agent system built with Semantic Kernel and extended it with new plugins so one assistant could reach several systems. Into the tool chain I added Text-to-SQL over Databricks, document extraction from Blob Storage, translation, SharePoint lookup, and a Service Desk SOAP service — integrated in under four weeks.

The Text-to-SQL agent carries an enriched metadata layer (business descriptions and worked query examples) because the model needs business context, not just column names, to map a question to the right tables. I built a custom evaluation framework with a golden dataset of ground-truth results, reaching 100% accuracy on controlled tests before every deploy, because an SQL agent that is confidently wrong is worse than no agent. I implemented RBAC impersonation so every Databricks query runs under the requesting user’s identity and permissions — the agent never widens what a person is allowed to see.

The chatbot is served by FastAPI endpoints with async token streaming, Pydantic-validated I/O, and end-to-end user impersonation on every request.

GOVERNED

Claude did

Scaffolded plugin boilerplate and FastAPI endpoint shapes, drafted Pydantic models, and proposed first-pass SQL-generation prompts and refactors I could review and pare back.

Guillermo did

Made the architecture calls — plugin contracts, the impersonation model, the metadata-layer design — and built the evaluation framework and its golden dataset. I decided what “correct” meant; the model proposed, the evals decided.

One exchange

An early Text-to-SQL prompt produced queries that read plausible but returned the wrong rows on edge cases. Rather than tune the prompt by feel, I wrote a golden dataset with ground-truth results first, then iterated against it until controlled tests passed at 100%. That eval-before-prompt habit is now how I gate every SQL change, because otherwise a confident-but-wrong query ships unnoticed.

Stack & Architecture

Models / Orchestration: Azure OpenAI · Semantic Kernel multi-agent orchestration
Retrieval: hybrid RAG on Azure AI Search (semantic + BM25 + cross-encoder reranking), permission-aware
Data: Databricks (Text-to-SQL) · Azure Blob Storage · SharePoint · MongoDB (selective plugin memory)
API: FastAPI (async token streaming, Pydantic-validated I/O, per-request impersonation)
Infra: Azure AKS · Azure DevOps CI/CD · Azure Key Vault

A single conversational agent routes each request to the right plugin — Text-to-SQL, document extraction, translation, SharePoint lookup, or Service Desk — under the requesting user’s identity. Structured questions go to Databricks via the metadata-enriched Text-to-SQL agent; document and knowledge questions go through the hybrid RAG pipeline on Azure AI Search, which fuses semantic and BM25 retrieval and reranks with a cross-encoder, retrieving only what the user is permitted to see. Secrets resolve through Key Vault; the service runs on AKS and ships through Azure DevOps.

Outcome

100% accuracy on the controlled golden-dataset tests, run before every deploy.
New tool chain (five plugins) integrated in under four weeks.
Every Databricks query runs under the requesting user’s permissions via RBAC impersonation.
In production use. [PLACEHOLDER — Guillermo to confirm breadth and figures: e.g. teams/users served, assistant queries handled per week, time-to-answer before vs. after.]

Lessons

Retrieval and SQL correctness are eval problems before they are model problems.
Give the model business context, not just schema; descriptions and examples beat column names.
Impersonate on every request — permissions belong at the data call, not the prompt.

Artifacts

Architecture diagram — [PLACEHOLDER — Guillermo to add sanitized diagram] ON-REQUEST
Repository — [PLACEHOLDER — client-internal] CONFIDENTIAL

Broaden the golden dataset to cover more business domains before each deploy.
Extend plugin contribution criteria so reviewed intern PRs can add tools safely.