Insights

RAG Gets Smarter: Two Roads to Better Retrieval and Reasoning

Written by Ruslan Kryvosheiev | Jun 24, 2025

 

Retrieval-Augmented Generation (RAG) has become the go-to architecture for enterprise AI systems that need accuracy, adaptability, and traceability. But if you've spent time tuning a RAG pipeline in production, you know the problem: retrieval often fails, and even when it succeeds, the model doesn't always reason effectively over the context it's given.

Two recent approaches offer very different paths forward. One focuses on teaching the model to reason better. The other says: leave the model alone—just fix the data pipeline.

Let's walk through both.

 

RARE: Train the Model to Reason, Not Memorize

The RARE approach—short for Retrieval-Augmented Reasoning Modeling—was introduced in a  March 2024 research paper. Its premise is simple but powerful: instead of pushing your model to memorize domain knowledge, train it to reason from retrieved evidence while externalizing knowledge to retrievable sources.

How RARE Works:

Knowledge Externalization: Domain knowledge is stored in external retrievable databases rather than model parameters

Reasoning Internalization: The model is fine-tuned on curated datasets that emphasize contextualized reasoning over memorization

Contextual Integration: During training, retrieved knowledge is injected into prompts, transforming the learning objective from rote memorization to knowledge application.

 

Simple Example:

Traditional Approach: Train a medical model to memorize "Hypertension drugs include ACE inhibitors, beta-blockers..."


RARE Approach: Train the model to reason: "Given retrieved drug information [external database], this patient with hypertension and diabetes would benefit from ACE inhibitors because they provide cardiovascular protection in diabetics, as evidenced by multiple retrieved studies showing..."

RARE Implementation

Here's how to implement RARE training:

Python

# 1. Generate training data with retrieved context

def create_rare_training_data(question, retrieved_docs, teacher_model="qwq-32b"):

    prompt = f"""

You are a medical expert. Use the retrieved documents to answer the question.

Think step-by-step and show your reasoning.

 

# Retrieved Documents

{retrieved_docs}

 

# Question  

{question}

 

Format: <think>reasoning</think><answer>final_answer</answer>

"""

    

    # Generate reasoning chain with teacher model

    response = teacher_model.generate(prompt, max_retries=8)

    return {

        "input": f"Documents: {retrieved_docs}\nQuestion: {question}",

        "output": response

    }

 

# 2. Fine-tune with contextualized reasoning

from transformers import Trainer, TrainingArguments

 

def train_rare_model(base_model, training_data):

    training_args = TrainingArguments(

        output_dir="./rare-model",

        num_train_epochs=5,

        learning_rate=1e-5,

        per_device_train_batch_size=8,

        warmup_ratio=0.05,

        logging_steps=100,

    )

    

    trainer = Trainer(

        model=base_model,

        args=training_args,

        train_dataset=training_data,

        # Loss focuses on reasoning, not memorization

        data_collator=ContextualReasoningCollator()

    )

    

    trainer.train()

    return trainer.model

 

# 3. Inference with retrieved knowledge

def rare_inference(model, question, retriever):

    # Retrieve relevant documents

    retrieved_docs = retriever.search(question, top_k=3)

    

    prompt = f"""

Use the retrieved documents to answer the question with step-by-step reasoning.

 

Documents: {retrieved_docs}

Question: {question}

"""

    

    return model.generate(prompt)

 

Anthropic's Contextual Retrieval: Don't Train—Preprocess Smarter


Anthropic offers a completely different take with its Contextual Retrieval system. They skip model training altogether and focus on the retrieval pipeline—specifically, how you prepare and embed the data before it's stored.

Simple Example:

Traditional Chunk: "Revenue grew by 3% over the previous quarter."


Contextualized Chunk: "This chunk is from an SEC filing on ACME Corp's Q2 2023 performance; the previous quarter's revenue was $314 million. Revenue grew by 3% over the previous quarter."

Contextual Retrieval Implementation

Here's how to implement Contextual Retrieval:

Python

import anthropic

from sentence_transformers import SentenceTransformer

import numpy as np

 

class ContextualRetriever:

    def __init__(self, anthropic_api_key):

        self.client = anthropic.Anthropic(api_key=anthropic_api_key)

        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')

        

    def contextualize_chunk(self, chunk, document):

        """Generate contextual information for a chunk using Claude"""

        

        prompt = f"""

<document>

{document}

</document>

 

Here is the chunk we want to situate within the whole document:

<chunk>

{chunk}

</chunk>

 

Please give a short succinct context to situate this chunk within the overall 

document for the purposes of improving search retrieval of the chunk. 

Answer only with the succinct context and nothing else.

"""

        

        response = self.client.messages.create(

            model="claude-3-haiku-20240307",

            max_tokens=100,

            messages=[{"role": "user", "content": prompt}]

        )

        

        return response.content[0].text

    

    def create_contextual_embeddings(self, documents):

        """Create contextual embeddings for all chunks"""

        

        contextual_chunks = []

        embeddings = []

        

        for doc in documents:

            # Split document into chunks

            chunks = self.split_document(doc, chunk_size=400)

            

            for chunk in chunks:

                # Generate context for chunk

                context = self.contextualize_chunk(chunk, doc)

                

                # Create contextualized chunk

                contextualized_chunk = f"{context}. {chunk}"

                contextual_chunks.append(contextualized_chunk)

                

                # Create embedding

                embedding = self.embedder.encode(contextualized_chunk)

                embeddings.append(embedding)

        

        return contextual_chunks, np.array(embeddings)

    

    def contextual_bm25_preprocessing(self, contextual_chunks):

        """Prepare contextualized chunks for BM25 indexing"""

        from rank_bm25 import BM25Okapi

        import string

        

        # Tokenize contextualized chunks

        tokenized_chunks = []

        for chunk in contextual_chunks:

            # Simple tokenization

            tokens = chunk.lower().translate(

                str.maketrans('', '', string.punctuation)

            ).split()

            tokenized_chunks.append(tokens)

        

        # Create BM25 index

        bm25 = BM25Okapi(tokenized_chunks)

        return bm25

    

    def search(self, query, top_k=5):

        """Hybrid search: embeddings + BM25 + optional reranking"""

        

        # 1. Embedding-based search

        query_embedding = self.embedder.encode(query)

        semantic_scores = np.dot(self.embeddings, query_embedding)

        semantic_top_k = np.argsort(semantic_scores)[-top_k*3:]

        

        # 2. BM25 search  

        query_tokens = query.lower().split()

        bm25_scores = self.bm25.get_scores(query_tokens)

        bm25_top_k = np.argsort(bm25_scores)[-top_k*3:]

        

        # 3. Combine results (simple approach)

        combined_indices = list(set(semantic_top_k) | set(bm25_top_k))

        

        # 4. Optional: Reranking step

        if self.use_reranking:

            reranked_results = self.rerank(query, combined_indices)

            return reranked_results[:top_k]

        

        return [self.contextual_chunks[i] for i in combined_indices[:top_k]]

 

# Usage example

def setup_contextual_retrieval(documents, anthropic_key):

    retriever = ContextualRetriever(anthropic_key)

    

    # Create contextual embeddings (one-time cost: ~$1.02/million tokens)

    contextual_chunks, embeddings = retriever.create_contextual_embeddings(documents)

    retriever.embeddings = embeddings

    retriever.contextual_chunks = contextual_chunks

    

    # Set up BM25 index

    retriever.bm25 = retriever.contextual_bm25_preprocessing(contextual_chunks)

    

    return retriever

 

Performance Comparison

Python

# Compare traditional vs contextual retrieval

def compare_retrieval_performance(queries, ground_truth):

    traditional_retriever = TraditionalRAG()

    contextual_retriever = setup_contextual_retrieval(documents, api_key)

    

    traditional_scores = []

    contextual_scores = []

    

    for query, expected_docs in zip(queries, ground_truth):

        # Traditional retrieval

        trad_results = traditional_retriever.search(query, top_k=20)

        trad_recall = calculate_recall_at_k(trad_results, expected_docs, k=20)

        traditional_scores.append(trad_recall)

        

        # Contextual retrieval  

        ctx_results = contextual_retriever.search(query, top_k=20)

        ctx_recall = calculate_recall_at_k(ctx_results, expected_docs, k=20)

        contextual_scores.append(ctx_recall)

    

    print(f"Traditional failure rate: {1 - np.mean(traditional_scores):.3f}")

    print(f"Contextual failure rate: {1 - np.mean(contextual_scores):.3f}")

    print(f"Improvement: {(np.mean(contextual_scores) - np.mean(traditional_scores)) / np.mean(traditional_scores) * 100:.1f}%")

RARE vs. Contextual Retrieval: Two Valid Paths

Here's how they compare:

Feature

RARE

Contextual Retrieval (Anthropic)

Primary Focus

Improve reasoning after retrieval

Improve retrieval before reasoning

Model Training Required?

Yes(fine-tuning)

No

Effort Concentration

Fine-tuning + curation

Preprocessing + indexing

Best For

Deep reasoning, structured domains

High-recall, broad-access systems

Infrastructure Load

Higher at inference

Higher at indexing time

Performance Gains

Up to 20% on reasoning tasks

Up to 67% reduction in failed retrievals

 

Combining Both Approaches

class HybridRAGSystem:

    def __init__(self, anthropic_key, rare_model_path):

        # Set up contextual retrieval for better document finding

        self.contextual_retriever = setup_contextual_retrieval(

            documents, anthropic_key

        )

        

        # Load RARE-trained model for better reasoning

        self.rare_model = load_rare_model(rare_model_path)

    

    def answer_question(self, question):

        # 1. Use contextual retrieval for better document retrieval

        relevant_docs = self.contextual_retriever.search(question, top_k=5)

        

        # 2. Use RARE model for better reasoning over retrieved docs

        response = self.rare_model.generate(

            f"Documents: {relevant_docs}\nQuestion: {question}"

        )

        

        return response

 

# Best of both worlds

hybrid_system = HybridRAGSystem(anthropic_key, "path/to/rare/model")

answer = hybrid_system.answer_question(

    "What are the contraindications for ACE inhibitors in diabetic patients?"

)

 

The Hybrid Future

In practice, we're seeing a growing trend toward hybrid pipelines:

  • Use contextual indexing to optimize retrieval accuracy
  • Pair it with lightweight reasoning fine-tuning for complex domains
  • Apply both techniques where retrieval quality and reasoning depth are critical
  • The goal is the same: make systems more helpful, not just more correct

 

Why This Matters for Enterprise AI


RAG's strength has always been in its modularity. But that also means your system is only as good as its weakest component. Poor retrieval and weak reasoning both undermine user trust—and drive teams back to traditional search or brittle rule-based systems.

These approaches give us new tools to build smarter, leaner, and more interpretable AI systems. They shift the optimization conversation from "how big is your model?" to "how intelligent is your full pipeline?"

Contextual Retrieval offers immediate improvements with minimal infrastructure changes—perfect for teams that need better retrieval performance today.RARE provides a path to more sophisticated reasoning capabilities—ideal for teams building domain-specific AI that needs to handle complex, nuanced scenarios.

Both represent the maturation of RAG from a simple "search + generate" pattern into sophisticated reasoning architectures that can compete with much larger models.

 

Want to see how we apply these approaches in production?

We're actively experimenting with both RARE-style fine-tuning and contextual embedding pipelines inside live client systems.

If you're facing retrieval headaches—or trying to figure out if your RAG stack is even salvageable — let’s connect.