Retrieval-Augmented Generation (RAG) has become the go-to architecture for enterprise AI systems that need accuracy, adaptability, and traceability. But if you've spent time tuning a RAG pipeline in production, you know the problem: retrieval often fails, and even when it succeeds, the model doesn't always reason effectively over the context it's given.
Two recent approaches offer very different paths forward. One focuses on teaching the model to reason better. The other says: leave the model alone—just fix the data pipeline.
Let's walk through both.
The RARE approach—short for Retrieval-Augmented Reasoning Modeling—was introduced in a March 2024 research paper. Its premise is simple but powerful: instead of pushing your model to memorize domain knowledge, train it to reason from retrieved evidence while externalizing knowledge to retrievable sources.
How RARE Works:
Knowledge Externalization: Domain knowledge is stored in external retrievable databases rather than model parameters
Reasoning Internalization: The model is fine-tuned on curated datasets that emphasize contextualized reasoning over memorization
Contextual Integration: During training, retrieved knowledge is injected into prompts, transforming the learning objective from rote memorization to knowledge application.
Simple Example:
Traditional Approach: Train a medical model to memorize "Hypertension drugs include ACE inhibitors, beta-blockers..."
RARE Approach: Train the model to reason: "Given retrieved drug information [external database], this patient with hypertension and diabetes would benefit from ACE inhibitors because they provide cardiovascular protection in diabetics, as evidenced by multiple retrieved studies showing..."
Here's how to implement RARE training:
Python
# 1. Generate training data with retrieved context
def create_rare_training_data(question, retrieved_docs, teacher_model="qwq-32b"):
prompt = f"""
You are a medical expert. Use the retrieved documents to answer the question.
Think step-by-step and show your reasoning.
# Retrieved Documents
{retrieved_docs}
# Question
{question}
Format: <think>reasoning</think><answer>final_answer</answer>
"""
# Generate reasoning chain with teacher model
response = teacher_model.generate(prompt, max_retries=8)
return {
"input": f"Documents: {retrieved_docs}\nQuestion: {question}",
"output": response
}
# 2. Fine-tune with contextualized reasoning
from transformers import Trainer, TrainingArguments
def train_rare_model(base_model, training_data):
training_args = TrainingArguments(
output_dir="./rare-model",
num_train_epochs=5,
learning_rate=1e-5,
per_device_train_batch_size=8,
warmup_ratio=0.05,
logging_steps=100,
)
trainer = Trainer(
model=base_model,
args=training_args,
train_dataset=training_data,
# Loss focuses on reasoning, not memorization
data_collator=ContextualReasoningCollator()
)
trainer.train()
return trainer.model
# 3. Inference with retrieved knowledge
def rare_inference(model, question, retriever):
# Retrieve relevant documents
retrieved_docs = retriever.search(question, top_k=3)
prompt = f"""
Use the retrieved documents to answer the question with step-by-step reasoning.
Documents: {retrieved_docs}
Question: {question}
"""
return model.generate(prompt)
Anthropic offers a completely different take with its Contextual Retrieval system. They skip model training altogether and focus on the retrieval pipeline—specifically, how you prepare and embed the data before it's stored.
Simple Example:
Traditional Chunk: "Revenue grew by 3% over the previous quarter."
Contextualized Chunk: "This chunk is from an SEC filing on ACME Corp's Q2 2023 performance; the previous quarter's revenue was $314 million. Revenue grew by 3% over the previous quarter."
Here's how to implement Contextual Retrieval:
Python
import anthropic
from sentence_transformers import SentenceTransformer
import numpy as np
class ContextualRetriever:
def __init__(self, anthropic_api_key):
self.client = anthropic.Anthropic(api_key=anthropic_api_key)
self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
def contextualize_chunk(self, chunk, document):
"""Generate contextual information for a chunk using Claude"""
prompt = f"""
<document>
{document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>
Please give a short succinct context to situate this chunk within the overall
document for the purposes of improving search retrieval of the chunk.
Answer only with the succinct context and nothing else.
"""
response = self.client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
def create_contextual_embeddings(self, documents):
"""Create contextual embeddings for all chunks"""
contextual_chunks = []
embeddings = []
for doc in documents:
# Split document into chunks
chunks = self.split_document(doc, chunk_size=400)
for chunk in chunks:
# Generate context for chunk
context = self.contextualize_chunk(chunk, doc)
# Create contextualized chunk
contextualized_chunk = f"{context}. {chunk}"
contextual_chunks.append(contextualized_chunk)
# Create embedding
embedding = self.embedder.encode(contextualized_chunk)
embeddings.append(embedding)
return contextual_chunks, np.array(embeddings)
def contextual_bm25_preprocessing(self, contextual_chunks):
"""Prepare contextualized chunks for BM25 indexing"""
from rank_bm25 import BM25Okapi
import string
# Tokenize contextualized chunks
tokenized_chunks = []
for chunk in contextual_chunks:
# Simple tokenization
tokens = chunk.lower().translate(
str.maketrans('', '', string.punctuation)
).split()
tokenized_chunks.append(tokens)
# Create BM25 index
bm25 = BM25Okapi(tokenized_chunks)
return bm25
def search(self, query, top_k=5):
"""Hybrid search: embeddings + BM25 + optional reranking"""
# 1. Embedding-based search
query_embedding = self.embedder.encode(query)
semantic_scores = np.dot(self.embeddings, query_embedding)
semantic_top_k = np.argsort(semantic_scores)[-top_k*3:]
# 2. BM25 search
query_tokens = query.lower().split()
bm25_scores = self.bm25.get_scores(query_tokens)
bm25_top_k = np.argsort(bm25_scores)[-top_k*3:]
# 3. Combine results (simple approach)
combined_indices = list(set(semantic_top_k) | set(bm25_top_k))
# 4. Optional: Reranking step
if self.use_reranking:
reranked_results = self.rerank(query, combined_indices)
return reranked_results[:top_k]
return [self.contextual_chunks[i] for i in combined_indices[:top_k]]
# Usage example
def setup_contextual_retrieval(documents, anthropic_key):
retriever = ContextualRetriever(anthropic_key)
# Create contextual embeddings (one-time cost: ~$1.02/million tokens)
contextual_chunks, embeddings = retriever.create_contextual_embeddings(documents)
retriever.embeddings = embeddings
retriever.contextual_chunks = contextual_chunks
# Set up BM25 index
retriever.bm25 = retriever.contextual_bm25_preprocessing(contextual_chunks)
return retriever
Python
# Compare traditional vs contextual retrieval
def compare_retrieval_performance(queries, ground_truth):
traditional_retriever = TraditionalRAG()
contextual_retriever = setup_contextual_retrieval(documents, api_key)
traditional_scores = []
contextual_scores = []
for query, expected_docs in zip(queries, ground_truth):
# Traditional retrieval
trad_results = traditional_retriever.search(query, top_k=20)
trad_recall = calculate_recall_at_k(trad_results, expected_docs, k=20)
traditional_scores.append(trad_recall)
# Contextual retrieval
ctx_results = contextual_retriever.search(query, top_k=20)
ctx_recall = calculate_recall_at_k(ctx_results, expected_docs, k=20)
contextual_scores.append(ctx_recall)
print(f"Traditional failure rate: {1 - np.mean(traditional_scores):.3f}")
print(f"Contextual failure rate: {1 - np.mean(contextual_scores):.3f}")
print(f"Improvement: {(np.mean(contextual_scores) - np.mean(traditional_scores)) / np.mean(traditional_scores) * 100:.1f}%")
Here's how they compare:
Feature |
RARE |
Contextual Retrieval (Anthropic) |
Primary Focus |
Improve reasoning after retrieval |
Improve retrieval before reasoning |
Model Training Required? |
Yes(fine-tuning) |
No |
Effort Concentration |
Fine-tuning + curation |
Preprocessing + indexing |
Best For |
Deep reasoning, structured domains |
High-recall, broad-access systems |
Infrastructure Load |
Higher at inference |
Higher at indexing time |
Performance Gains |
Up to 20% on reasoning tasks |
Up to 67% reduction in failed retrievals |
class HybridRAGSystem:
def __init__(self, anthropic_key, rare_model_path):
# Set up contextual retrieval for better document finding
self.contextual_retriever = setup_contextual_retrieval(
documents, anthropic_key
)
# Load RARE-trained model for better reasoning
self.rare_model = load_rare_model(rare_model_path)
def answer_question(self, question):
# 1. Use contextual retrieval for better document retrieval
relevant_docs = self.contextual_retriever.search(question, top_k=5)
# 2. Use RARE model for better reasoning over retrieved docs
response = self.rare_model.generate(
f"Documents: {relevant_docs}\nQuestion: {question}"
)
return response
# Best of both worlds
hybrid_system = HybridRAGSystem(anthropic_key, "path/to/rare/model")
answer = hybrid_system.answer_question(
"What are the contraindications for ACE inhibitors in diabetic patients?"
)
In practice, we're seeing a growing trend toward hybrid pipelines:
RAG's strength has always been in its modularity. But that also means your system is only as good as its weakest component. Poor retrieval and weak reasoning both undermine user trust—and drive teams back to traditional search or brittle rule-based systems.
These approaches give us new tools to build smarter, leaner, and more interpretable AI systems. They shift the optimization conversation from "how big is your model?" to "how intelligent is your full pipeline?"
Contextual Retrieval offers immediate improvements with minimal infrastructure changes—perfect for teams that need better retrieval performance today.RARE provides a path to more sophisticated reasoning capabilities—ideal for teams building domain-specific AI that needs to handle complex, nuanced scenarios.
Both represent the maturation of RAG from a simple "search + generate" pattern into sophisticated reasoning architectures that can compete with much larger models.
Want to see how we apply these approaches in production?
We're actively experimenting with both RARE-style fine-tuning and contextual embedding pipelines inside live client systems.
If you're facing retrieval headaches—or trying to figure out if your RAG stack is even salvageable — let’s connect.