Why Your RAG Pipeline Fails in Production

Why Your RAG Pipeline Fails in Production

Tutorials show RAG working at 95% recall on toy datasets. Production systems hit 60%. The gap isn't the technology — it's three implementation choices that look harmless and aren't.

I run RAG over my own transceiver corpus for TIP, every day, in production. The tutorial recall numbers and the production recall numbers are different worlds. Below is where the gap actually lives, and the three implementation choices that explain it.

The standard RAG tutorial: load documents, chunk at 512 tokens, embed, store in vector database, retrieve at query time, pass to LLM. Recall on the tutorial's 50-document dataset: 90–95%. Take that pipeline to 50,000 documents with varied formats, multiple languages, and real user queries that don't look like document headers. Recall drops to 55–65%.

Failure 1: Chunking Destroys Semantic Coherence

Fixed-size 512-token chunking is the default because it's simple. It's also the biggest recall killer for most document types. A chunk that splits mid-sentence, separating a claim from its supporting data, generates an embedding that represents neither accurately. That chunk never retrieves correctly for queries about either topic.

What works: semantic chunking on document structure. Paragraphs as units for prose. Sections for technical docs. Tables as atomic units — never split a table. The chunks are variable-size but structurally coherent. Embedding quality improves because each chunk represents one complete idea.

Implementation: langchain's RecursiveCharacterTextSplitter with structure-aware separators gets you 70% of the way. For higher quality, parse document structure first (Docling for PDFs, marked for Markdown) and chunk on structural boundaries.

Failure 2: Embedding Model Mismatch

You embedded your corpus with text-embedding-ada-002. Your queries arrive as short, keyword-heavy phrases ("transceiver 400G ZR compatibility"). Ada-002 was optimized for symmetric similarity — comparing text of similar length and structure. Short queries against long document embeddings produce systematically lower similarity scores than expected.

Embedding Models by Query Pattern
text-embedding-ada-002
Symmetric matching, long-to-long. Legacy, works for chatbots.
text-embedding-3-small/large
Better asymmetric matching, short queries to long docs
nomic-embed-text (local)
Strong asymmetric recall, 768-dim, runs via Ollama
bge-m3 (local)
Multilingual, long-context, best for technical docs

The rule: re-embed your entire corpus when you change models. Mixing embeddings from different models in one collection is a correctness disaster — similarity scores are not comparable across models. Qdrant supports named vectors per collection for A/B testing without collection migration.

Failure 3: No Reranking

Top-K=5 retrieval returns the 5 most similar chunks by cosine distance. Not the 5 most relevant chunks for answering the query. Vector similarity is a proxy for relevance, not relevance itself.

The fix: retrieve top-K=20, rerank with a cross-encoder, pass top-5 to the LLM. Cross-encoders (BGE reranker, Cohere reranker, cross-encoder/ms-marco-MiniLM-L-6-v2) do full pairwise scoring of query + document. Expensive at large K, accurate at K=20. Recall improvement: typically 15–25 percentage points.

Retrieved K is a recall parameter. Passed K is a precision parameter. Tune them independently. Rerank to 5, don't retrieve 20 and pass all 20.

Combined Impact

Each fix delivers 10–15% recall improvement independently. Combined: semantic chunking + matched embedding model + retrieve-then-rerank closes the gap between tutorial recall and production recall to under 10%. That's the target. None of these require new infrastructure — they're implementation choices in code you already have.