RAG Gets Smarter: Two Roads to Better Retrieval and Reasoning

Written by Ruslan Kryvosheiev | Jun 24, 2025

Retrieval-Augmented Generation (RAG) has become the go-to architecture for enterprise AI systems that need accuracy, adaptability, and traceability. But if you've spent time tuning a RAG pipeline in production, you know the problem: retrieval often fails, and even when it succeeds, the model doesn't always reason effectively over the context it's given.

Two recent approaches offer very different paths forward. One focuses on teaching the model to reason better. The other says: leave the model alone—just fix the data pipeline.

Let's walk through both.

RARE: Train the Model to Reason, Not Memorize

The RARE approach—short for Retrieval-Augmented Reasoning Modeling—was introduced in a March 2024 research paper. Its premise is simple but powerful: instead of pushing your model to memorize domain knowledge, train it to reason from retrieved evidence while externalizing knowledge to retrievable sources.

How RARE Works:

Knowledge Externalization: Domain knowledge is stored in external retrievable databases rather than model parameters

Reasoning Internalization: The model is fine-tuned on curated datasets that emphasize contextualized reasoning over memorization

Contextual Integration: During training, retrieved knowledge is injected into prompts, transforming the learning objective from rote memorization to knowledge application.

Simple Example:

Traditional Approach: Train a medical model to memorize "Hypertension drugs include ACE inhibitors, beta-blockers..."

RARE Approach: Train the model to reason: "Given retrieved drug information [external database], this patient with hypertension and diabetes would benefit from ACE inhibitors because they provide cardiovascular protection in diabetics, as evidenced by multiple retrieved studies showing..."

RARE Implementation

Here's how to implement RARE training:

Python

# 1. Generate training data with retrieved context

def create_rare_training_data(question, retrieved_docs, teacher_model="qwq-32b"):

prompt = f"""

You are a medical expert. Use the retrieved documents to answer the question.

Think step-by-step and show your reasoning.

# Retrieved Documents

{retrieved_docs}

# Question

{question}

Format: <think>reasoning</think><answer>final_answer</answer>

"""

# Generate reasoning chain with teacher model

response = teacher_model.generate(prompt, max_retries=8)

return {

"input": f"Documents: {retrieved_docs}\nQuestion: {question}",

"output": response

}

# 2. Fine-tune with contextualized reasoning

from transformers import Trainer, TrainingArguments

def train_rare_model(base_model, training_data):

training_args = TrainingArguments(

output_dir="./rare-model",

num_train_epochs=5,

learning_rate=1e-5,

per_device_train_batch_size=8,

warmup_ratio=0.05,

logging_steps=100,

)

trainer = Trainer(

model=base_model,

args=training_args,

train_dataset=training_data,

# Loss focuses on reasoning, not memorization

data_collator=ContextualReasoningCollator()

)

trainer.train()

return trainer.model

# 3. Inference with retrieved knowledge

def rare_inference(model, question, retriever):

# Retrieve relevant documents

retrieved_docs = retriever.search(question, top_k=3)

prompt = f"""

Use the retrieved documents to answer the question with step-by-step reasoning.

Documents: {retrieved_docs}

Question: {question}

"""

return model.generate(prompt)

Anthropic's Contextual Retrieval: Don't Train—Preprocess Smarter

Anthropic offers a completely different take with its Contextual Retrieval system. They skip model training altogether and focus on the retrieval pipeline—specifically, how you prepare and embed the data before it's stored.

Simple Example:

Traditional Chunk: "Revenue grew by 3% over the previous quarter."

Contextualized Chunk: "This chunk is from an SEC filing on ACME Corp's Q2 2023 performance; the previous quarter's revenue was $314 million. Revenue grew by 3% over the previous quarter."

Contextual Retrieval Implementation

Here's how to implement Contextual Retrieval:

Python

import anthropic

from sentence_transformers import SentenceTransformer

import numpy as np

class ContextualRetriever:

def __init__(self, anthropic_api_key):

self.client = anthropic.Anthropic(api_key=anthropic_api_key)

self.embedder = SentenceTransformer('all-MiniLM-L6-v2')

def contextualize_chunk(self, chunk, document):

"""Generate contextual information for a chunk using Claude"""

prompt = f"""

{document}

</document>

Here is the chunk we want to situate within the whole document:

<chunk>

{chunk}

</chunk>

Please give a short succinct context to situate this chunk within the overall

document for the purposes of improving search retrieval of the chunk.

Answer only with the succinct context and nothing else.

"""

response = self.client.messages.create(

model="claude-3-haiku-20240307",

max_tokens=100,

messages=[{"role": "user", "content": prompt}]

)

return response.content[0].text

def create_contextual_embeddings(self, documents):

"""Create contextual embeddings for all chunks"""

contextual_chunks = []

embeddings = []

for doc in documents:

# Split document into chunks

chunks = self.split_document(doc, chunk_size=400)

for chunk in chunks:

# Generate context for chunk

context = self.contextualize_chunk(chunk, doc)

# Create contextualized chunk

contextualized_chunk = f"{context}. {chunk}"

contextual_chunks.append(contextualized_chunk)

# Create embedding

embedding = self.embedder.encode(contextualized_chunk)

embeddings.append(embedding)

return contextual_chunks, np.array(embeddings)

def contextual_bm25_preprocessing(self, contextual_chunks):

"""Prepare contextualized chunks for BM25 indexing"""

from rank_bm25 import BM25Okapi

import string

# Tokenize contextualized chunks

tokenized_chunks = []

for chunk in contextual_chunks:

# Simple tokenization

tokens = chunk.lower().translate(

str.maketrans('', '', string.punctuation)

).split()

tokenized_chunks.append(tokens)

# Create BM25 index

bm25 = BM25Okapi(tokenized_chunks)

return bm25

def search(self, query, top_k=5):

"""Hybrid search: embeddings + BM25 + optional reranking"""

# 1. Embedding-based search

query_embedding = self.embedder.encode(query)

semantic_scores = np.dot(self.embeddings, query_embedding)

semantic_top_k = np.argsort(semantic_scores)[-top_k*3:]

# 2. BM25 search

query_tokens = query.lower().split()

bm25_scores = self.bm25.get_scores(query_tokens)

bm25_top_k = np.argsort(bm25_scores)[-top_k*3:]

# 3. Combine results (simple approach)

combined_indices = list(set(semantic_top_k) | set(bm25_top_k))

# 4. Optional: Reranking step

if self.use_reranking:

reranked_results = self.rerank(query, combined_indices)

return reranked_results[:top_k]

return [self.contextual_chunks[i] for i in combined_indices[:top_k]]

# Usage example

def setup_contextual_retrieval(documents, anthropic_key):

retriever = ContextualRetriever(anthropic_key)

# Create contextual embeddings (one-time cost: ~$1.02/million tokens)

contextual_chunks, embeddings = retriever.create_contextual_embeddings(documents)

retriever.embeddings = embeddings

retriever.contextual_chunks = contextual_chunks

# Set up BM25 index

retriever.bm25 = retriever.contextual_bm25_preprocessing(contextual_chunks)

return retriever

Performance Comparison

Python

# Compare traditional vs contextual retrieval

def compare_retrieval_performance(queries, ground_truth):

traditional_retriever = TraditionalRAG()

contextual_retriever = setup_contextual_retrieval(documents, api_key)

traditional_scores = []

contextual_scores = []

for query, expected_docs in zip(queries, ground_truth):

# Traditional retrieval

trad_results = traditional_retriever.search(query, top_k=20)

trad_recall = calculate_recall_at_k(trad_results, expected_docs, k=20)

traditional_scores.append(trad_recall)

# Contextual retrieval

ctx_results = contextual_retriever.search(query, top_k=20)

ctx_recall = calculate_recall_at_k(ctx_results, expected_docs, k=20)

contextual_scores.append(ctx_recall)

print(f"Traditional failure rate: {1 - np.mean(traditional_scores):.3f}")

print(f"Contextual failure rate: {1 - np.mean(contextual_scores):.3f}")

print(f"Improvement: {(np.mean(contextual_scores) - np.mean(traditional_scores)) / np.mean(traditional_scores) * 100:.1f}%")

RARE vs. Contextual Retrieval: Two Valid Paths

Here's how they compare:

Feature	RARE	Contextual Retrieval (Anthropic)
Primary Focus	Improve reasoning after retrieval	Improve retrieval before reasoning
Model Training Required?	Yes(fine-tuning)	No
Effort Concentration	Fine-tuning + curation	Preprocessing + indexing
Best For	Deep reasoning, structured domains	High-recall, broad-access systems
Infrastructure Load	Higher at inference	Higher at indexing time
Performance Gains	Up to 20% on reasoning tasks	Up to 67% reduction in failed retrievals

Combining Both Approaches

class HybridRAGSystem:

def __init__(self, anthropic_key, rare_model_path):

# Set up contextual retrieval for better document finding

self.contextual_retriever = setup_contextual_retrieval(

documents, anthropic_key

)

# Load RARE-trained model for better reasoning

self.rare_model = load_rare_model(rare_model_path)

def answer_question(self, question):

# 1. Use contextual retrieval for better document retrieval

relevant_docs = self.contextual_retriever.search(question, top_k=5)

# 2. Use RARE model for better reasoning over retrieved docs

response = self.rare_model.generate(

f"Documents: {relevant_docs}\nQuestion: {question}"

)

return response

# Best of both worlds

hybrid_system = HybridRAGSystem(anthropic_key, "path/to/rare/model")

answer = hybrid_system.answer_question(

"What are the contraindications for ACE inhibitors in diabetic patients?"

)

The Hybrid Future

In practice, we're seeing a growing trend toward hybrid pipelines:

Use contextual indexing to optimize retrieval accuracy
Pair it with lightweight reasoning fine-tuning for complex domains
Apply both techniques where retrieval quality and reasoning depth are critical
The goal is the same: make systems more helpful, not just more correct

Why This Matters for Enterprise AI

RAG's strength has always been in its modularity. But that also means your system is only as good as its weakest component. Poor retrieval and weak reasoning both undermine user trust—and drive teams back to traditional search or brittle rule-based systems.

These approaches give us new tools to build smarter, leaner, and more interpretable AI systems. They shift the optimization conversation from "how big is your model?" to "how intelligent is your full pipeline?"

Contextual Retrieval offers immediate improvements with minimal infrastructure changes—perfect for teams that need better retrieval performance today.RARE provides a path to more sophisticated reasoning capabilities—ideal for teams building domain-specific AI that needs to handle complex, nuanced scenarios.

Both represent the maturation of RAG from a simple "search + generate" pattern into sophisticated reasoning architectures that can compete with much larger models.

Want to see how we apply these approaches in production?

We're actively experimenting with both RARE-style fine-tuning and contextual embedding pipelines inside live client systems.

If you're facing retrieval headaches—or trying to figure out if your RAG stack is even salvageable — let’s connect.

View full post