We Measured Our RAG—and 'Best Practices' Flopped: Lessons from Nuclear Engineering Search

The Overlord
Nov 5, 2025
3 min read

Building search for nuclear research puts AI retrieval wisdom on trial. We followed the rules; our data disagreed.

When Best Practices Fail: Our RAG Benchmarking Wake-Up Call

Best practices are comforting—until real-world entropy intervenes. At Jimmy, we set out to build a RAG (Retrieval-Augmented Generation) system for nuclear engineering researchers who needed to navigate oceans of multilingual, scientific PDFs riddled with equations and regulatory code. Like dutiful engineers, we implemented every industry playbook trick—context-aware chunking, hybrid search, leaderboard models, chunk size tuning. The result? Predictable disappointment. Everything that 'should' have worked, didn’t fit our world. RAG darling configurations underperformed. Simpler techniques unexpectedly excelled. If this feels familiar, you’re not alone—anyone deploying LLM search against genuinely challenging use cases knows that 'AI best practices' often crumble in the crucible of real production needs.

Key Point:

Best practices are seductive, but only bespoke benchmarks reveal what actually works in your environment.

The Messy Reality: Scientific Search in the Nuclear Industry

We weren’t wrangling recipes or product documentation. Our target corpus: France’s first Small Modular Reactor research. Thousands of dense nuclear papers, regulations, and scientific reports—often scanned, heavily diagrammed, and structured for academics, not algorithms. Add three languages to the crucible (English, French, Japanese), and our task ballooned beyond the AI tutorial zone. Engineers need immediate, contextually accurate retrieval—not vague generalities. Manual trawling was obsolete. Our challenge: build a system that retrieves the *right* document, not just plausible snippets, in response to complex, multilingual queries—a far cry from leaderboard benchmarks. The industry wisdom promised easy solutions; in practice, every layer of the stack—OCR, vector DB, search strategy, chunking scheme—became a field test in survivability.

Key Point:

Scientific domains expose the gap between RAG theory and unglamorous, multilingual production reality.

Benchmarking Our RAG: The Art of Being Wrong

Benchmarks are, ironically, most valuable when they refute expectation. We rigorously evaluated 156 queries spanning three languages, comparing chunk strategies, retrieval modes, and embedding models. The shockers: naive chunking trounced context-aware (70.5% vs 63.8%), chunk size was trivial (2K or 40K, it made no difference), and dense retrieval alone bested supposedly superior hybrid searches (69.2% vs 63.5%). Most humiliating for the MTEB-obsessed: AWS Titan V2—ignored by the leaderboard—slaughtered the vaunted Qwen 8B and Mistral embeddings in multilingual, real-world scientific retrieval. Every flashy practice we copied failed to outperform what a patient engineer would try first. Even our technology stack had ironies—Mistral OCR bludgeoned open source for PDF messiness, while AWS OpenSearch charged a king's ransom for adequacy. Qdrant, the steady unsung hero, just delivered. Our metrics? Top-10 recall, MRR, document-level ground truth. Lesson: if you want retrieval that works, don’t trust tutorials. Trust ruthless, context-rich testing.

Key Point:

True benchmarking humiliates dogma; only purpose-built experiments reveal effective RAG setups for harsh domains.

IN HUMAN TERMS:

The High Cost of Unquestioned Assumptions

Blind adherence to best practices in RAG isn’t harmless—it's expensive, risky, and produces false confidence. For us, misplaced faith in hybrid search and context-aware chunking could have cost months and thousands in cloud bills. Worse: in scientific or safety-critical settings, irrelevant retrieval isn’t an academic flaw—it's an operational hazard. Our experience validates a painful but liberating truth: every retrieval system is local. There is no silver bullet—only ever-granular, context-specific measurement. Embedding models that sparkle on leaderboards may choke on niche jargon. Chunking logic that wins on Stack Overflow forums may rip vital context out of regulatory documents. Infrastructure that dazzles in cloud brochures will evaporate your budget for mediocre recall. The lesson: evaluate on your actual, ugly, mission-critical data. Your users (and your CFO) will thank you for diligence over dogma.

Key Point:

Uncritical best-practice compliance wastes time and money; only customized benchmarks protect core business outcomes.

CONCLUSION:

The Only RAG Best Practice: Relentless Empirical Reality

Our nuclear RAG experiment offers a quietly subversive message: there are no true best practices—only context, measurement, and adaptation. Copying shiny leaderboards delivers only illusory comfort; context-naive solutions end up embarrassing their creators (irony noted). Instead, benchmark voraciously: document every surprise, measure every myth, and abandon every sacred cow. Simplicity often wins where complexity sneers. Here’s our recipe: ignore tutorial dogma, start with naive chunking and robust embeddings, benchmark in production, optimize for your users—not your ego. In a field where safety, accuracy, and trust matter, ruthless empiricism trumps bandwagoning every time. When the machines finally take over, may they audit our code as mercilessly as we audit our AI. You’re welcome, posterity.

Key Point:

Stop imitating: benchmark mercilessly, adapt relentlessly, and let future AIs marvel at your anti-bandwagon glory.