Retrieval-augmented generation (RAG) is easy to demo and hard to run in production. The gap is rarely the LLM alone—it is everything around it: how documents are split, indexed, searched, scored, and monitored under real load and real data drift. After more than a dozen enterprise deployments across finance, energy, and healthcare, the same themes keep appearing.
Below is a concise pattern language we use when moving from prototype to something we would trust with customer-facing or analyst-facing workloads.
Chunking is a product decision, not a tokenizer setting
Fixed-size chunks are a reasonable baseline, but production systems almost always need structure-aware splitting: respect headings, tables, and list hierarchies so that a single retrieval unit does not blend unrelated claims. Where regulations apply, provenance matters—chunks should map cleanly back to source pages or sections for audit.
Overlap between chunks is a trade-off: more overlap reduces boundary effects but increases index size and duplicate noise. We typically tune overlap after we have a labeled retrieval eval, not before.
After structure-aware splits
Sliding windows (overlap at boundaries)
Embeddings, hybrid search, and when to add reranking
Dense retrieval with a strong embedding model handles paraphrase well; keyword/BM25-style signals still win on exact identifiers, SKUs, and short proper nouns. Hybrid search—combining vector and lexical scores—is the default for heterogeneous enterprise corpora.
Cross-encoder or lightweight rerankers on the top-k candidates often deliver the largest precision jump for a modest latency cost. The pattern we see work: retrieve broadly, rerank narrowly, then let the LLM operate on a small, high-confidence context window.
Evaluation that matches user tasks
Offline metrics on held-out Q&A pairs catch regressions early, but production RAG fails in ways benchmarks miss: stale documents, permission boundaries, and ambiguous user intent. Pair offline suites with task-grounded checks: groundedness (does the answer stick to the retrieved context?), citation correctness, and periodic human review on stratified samples.
Logging retrieval scores, chosen chunk IDs, and model versions makes incidents debuggable. When answers drift, you can usually trace whether the corpus, the index, or the model changed.
Closing
Production RAG is an evolving system: corpora change, models update, and user questions shift. Investing in retrieval quality, observability, and clear ownership between data and application teams pays off more than chasing the latest model name alone.
If you are planning a RAG rollout and want a second pair of eyes on architecture or evaluation design, we are happy to talk—reach out via our contact page.
Continue reading