RAG Zelebrix

RAG chatbot answering from a multi-tenant payments SaaS knowledge base.

In Progress LAST UPDATED: [PLACEHOLDER — Guillermo to confirm: last-updated date] ROLE: Designed and led the build — AI-assisted, reviewed commit by commit

Zelebrix is a multi-tenant payments SaaS; tenant data is sensitive and must stay inside the company’s own systems. Guillermo joined to lead this AI feature, currently an (unpaid) collaboration with an ongoing relationship likely; the feature is scheduled to deploy to production in a few months.

Problem

Users of a multi-tenant payments platform needed answers that lived across product documentation and their own tenant data, and a general-purpose chatbot could not be trusted with either side of that. Semantic search alone missed exact Spanish terms and synonyms; a naive setup would also send tenant records to an external model. The chatbot had to be accurate in Spanish, honest about what it did not know, and architecturally incapable of leaking tenant data to the external LLM.

Constraints

External LLM must never see tenant data.
Spanish-first retrieval; exact terms and synonyms must match.
Tenant data queries execute locally so the external LLM never sees tenant data.
Must run on CPU, GPU, or remote backend.
Every retrieval stage needs a kill-switch and fallback.

Approach

I designed a hybrid retrieval pipeline rather than relying on a single method, because dense search alone was losing exact Spanish terms.

Hybrid retrieval. Dense pgvector similarity and Spanish Postgres full-text search run in parallel, then fuse with Reciprocal Rank Fusion (RRF) — chosen because vector recall and lexical precision fail on different queries, and RRF combines both rankings without tuning a weight per query.

Rerank and gate. A cross-encoder reranker reorders the fused candidates, and a relevance gate decides whether the context is strong enough to answer at all — so the bot abstains instead of inventing, because a confident wrong answer in payments is worse than “I don’t know.”

Privacy by design. PII is scrubbed before anything reaches the LLM, and data queries execute locally, so the external model never sees tenant data.

Hardening. A configurable CPU/GPU/remote compute backend, plus per-stage kill-switches and fallbacks, keep the system answering even when one component is degraded.

METHOD

Claude did

Scaffolded the retrieval and ingestion modules, wrote the RRF fusion and reranker glue, drafted the PII-scrubbing pass, and generated harness code for the evaluation suite from my specifications.

Guillermo did

Owned the architecture — the hybrid-plus-gate design, the privacy boundary, and the backend abstraction — designed the 150-question golden set and the retrieval probe, and reviewed every commit before it merged.

One exchange

Early answers were drifting and I suspected the generation model. Instead of swapping models on a hunch, I had Claude build an LLM benchmark against the golden set that isolated retrieval from generation. The data showed the bottleneck was retrieval — specifically Spanish synonyms — not the model. That result is why the Spanish full-text path and RRF exist, and the benchmark is now part of how I diagnose this class of system, because otherwise you spend weeks tuning the wrong layer.

Stack

Models: external LLM (generation) · cross-encoder reranker
Retrieval: pgvector (dense) · Spanish Postgres full-text · RRF fusion · relevance gate
Data: PostgreSQL · local data queries · PII scrubbing before the LLM
Infra: configurable CPU/GPU/remote compute backend · per-stage kill-switches and fallbacks

Outcome

Built and validated against a 150-question golden set; scheduled to deploy to production in [PLACEHOLDER — Guillermo to confirm: target window, ~a few months].
Hybrid retrieval plus RRF closed the Spanish synonym gap that pure vector search missed (validated against the golden set).
Privacy boundary holds by construction: the external LLM never receives tenant data.
[PLACEHOLDER — Guillermo to confirm: golden-set retrieval/answer score]

Lessons

Benchmark before you swap models — the bottleneck was retrieval, not generation.
In Spanish-first retrieval, lexical and synonym matching earn their place beside vectors.
A relevance gate that lets the bot abstain is a privacy and trust feature, not a fallback.

Artifacts

Repo CONFIDENTIAL
Demo / pre-production system ON-REQUEST

[PLACEHOLDER — Guillermo to confirm: version 2 direction, e.g. expanding the golden set or adding query-rewrite for synonyms]
Promote the retrieval probe and RAGAS-lite scorer into a shared evaluation harness across projects.