Gestamp Agents
Production multi-agent assistant that answers staff questions from governed enterprise data.
Gestamp, Madrid; internal data and schemas kept confidential.
Problem
Staff needed answers that lived across disconnected systems: structured tables in Databricks, documents in Blob Storage, pages in SharePoint, and tickets in a Service Desk. Finding one answer meant knowing which system held it, having the right permissions, and translating a business question into a query. That work fell to whoever could read schemas and write SQL. The bottleneck was access and translation, not data: the answers already existed, but reaching them was slow and uneven.
Constraints
- Every query must run under the requesting user’s permissions.
- Client data and schemas stay confidential; no internal details exposed.
- Answers must be correct before deploy, not after.
- Must fit the corporate Azure and Databricks stack.
- Secrets managed centrally, never in code or config.
Approach
I worked on a production multi-agent system built with Semantic Kernel and extended it with new plugins so one assistant could reach several systems. Into the tool chain I added Text-to-SQL over Databricks, document extraction from Blob Storage, translation, SharePoint lookup, and a Service Desk SOAP service — integrated in under four weeks.
The Text-to-SQL agent carries an enriched metadata layer (business descriptions and worked query examples) because the model needs business context, not just column names, to map a question to the right tables. I built a custom evaluation framework with a golden dataset of ground-truth results, reaching 100% accuracy on controlled tests before every deploy, because an SQL agent that is confidently wrong is worse than no agent. I implemented RBAC impersonation so every Databricks query runs under the requesting user’s identity and permissions — the agent never widens what a person is allowed to see.
The chatbot is served by FastAPI endpoints with async token streaming, Pydantic-validated I/O, and end-to-end user impersonation on every request.
GOVERNED
Claude did
Scaffolded plugin boilerplate and FastAPI endpoint shapes, drafted Pydantic models, and proposed first-pass SQL-generation prompts and refactors I could review and pare back.
Guillermo did
Made the architecture calls — plugin contracts, the impersonation model, the metadata-layer design — and built the evaluation framework and its golden dataset. I decided what “correct” meant; the model proposed, the evals decided.
One exchange
An early Text-to-SQL prompt produced queries that read plausible but returned the wrong rows on edge cases. Rather than tune the prompt by feel, I wrote a golden dataset with ground-truth results first, then iterated against it until controlled tests passed at 100%. That eval-before-prompt habit is now how I gate every SQL change, because otherwise a confident-but-wrong query ships unnoticed.
Stack & Architecture
- Models / Orchestration: Azure OpenAI · Semantic Kernel multi-agent orchestration
- Retrieval: hybrid RAG on Azure AI Search (semantic + BM25 + cross-encoder reranking), permission-aware
- Data: Databricks (Text-to-SQL) · Azure Blob Storage · SharePoint · MongoDB (selective plugin memory)
- API: FastAPI (async token streaming, Pydantic-validated I/O, per-request impersonation)
- Infra: Azure AKS · Azure DevOps CI/CD · Azure Key Vault
A single conversational agent routes each request to the right plugin — Text-to-SQL, document extraction, translation, SharePoint lookup, or Service Desk — under the requesting user’s identity. Structured questions go to Databricks via the metadata-enriched Text-to-SQL agent; document and knowledge questions go through the hybrid RAG pipeline on Azure AI Search, which fuses semantic and BM25 retrieval and reranks with a cross-encoder, retrieving only what the user is permitted to see. Secrets resolve through Key Vault; the service runs on AKS and ships through Azure DevOps.
Outcome
- 100% accuracy on the controlled golden-dataset tests, run before every deploy.
- New tool chain (five plugins) integrated in under four weeks.
- Every Databricks query runs under the requesting user’s permissions via RBAC impersonation.
- In production use. [PLACEHOLDER — Guillermo to confirm breadth and figures: e.g. teams/users served, assistant queries handled per week, time-to-answer before vs. after.]
Lessons
- Retrieval and SQL correctness are eval problems before they are model problems.
- Give the model business context, not just schema; descriptions and examples beat column names.
- Impersonate on every request — permissions belong at the data call, not the prompt.
Artifacts
- Architecture diagram — [PLACEHOLDER — Guillermo to add sanitized diagram] ON-REQUEST
- Repository — [PLACEHOLDER — client-internal] CONFIDENTIAL
Next
- Broaden the golden dataset to cover more business domains before each deploy.
- Extend plugin contribution criteria so reviewed intern PRs can add tools safely.