Speculative RAG is a novel Retrieval-Augmented Generation (RAG) framework designed to improve both the accuracy and efficiency of large language models (LLMs) when answering knowledge-intensive questions. Traditional RAG approaches suffer from long input contexts and high latency, as they require incorporating all retrieved documents into the prompt.

Speculative RAG overcomes these challenges by introducing a two-step process:

Drafting Phase: A smaller, specialized model (the RAG Drafter) generates multiple answer drafts, each based on a subset of retrieved documents.
Verification Phase: A larger, generalist model (the RAG Verifier) evaluates the drafts and selects the best one, reducing computational overhead.

This method not only speeds up response generation but also ensures higher quality answers by incorporating diverse perspectives from different document subsets.

How Does Speculative RAG Differ from Existing Techniques?

Method	Strengths	Weaknesses
Standard RAG	Uses all retrieved documents, maximizing context coverage	Slow, inefficient due to long input sequences
Self-Reflective RAG	Uses instruction-tuning to critique retrieved content	Requires additional fine-tuning of the LLM
Corrective RAG	Uses an external retrieval evaluator to filter docs	Lacks advanced reasoning capabilities
Speculative RAG	Uses smaller RAG drafter for drafting and generalist LM for verification	More complex pipeline but faster and more accurate

Unlike Self-Reflective RAG, Speculative RAG does not require expensive instruction-tuning of the generalist LM. Unlike Corrective RAG, it enhances reasoning ability by integrating multiple perspectives in parallel.

How Does Speculative RAG Work?

Step 1: Retrieve Documents

For a given query Q, retrieve a set of documents D from an external database.

Step 2: Multi-Perspective Sampling

Cluster retrieved documents based on content similarity.
Sample one document per cluster to form diverse subsets of evidence.

Step 3: Generate Draft Answers

Pass each document subset to a smaller, specialized RAG Drafter.
The RAG Drafter produces:
- Answer Draft ( $\alpha$ ): A possible response.
- Rationale ( $\beta$ ): Justification for the answer.

Step 4: Evaluate Drafts with a Generalist LM

The RAG Verifier (a larger, generalist LM) evaluates drafts using:
1. Self-Consistency Score (Does the draft logically follow from the rationale?)
2. Self-Reflection Score (Does the rationale properly support the answer?)
3. Draft Confidence Score (Likelihood that the generated answer is correct)
The highest-scoring draft is selected as the final answer.

Main Benefits of Speculative RAG

Higher accuracy: Ensures evidence-backed answers by leveraging diverse document subsets.
Lower latency: Parallel processing reduces computation time by up to 51%.
No instruction-tuning for the generalist LM: Avoids expensive retraining.
Handles diverse perspectives: Reduces the "lost-in-the-middle" problem of long-context retrieval.

Conclusion

Speculative RAG revolutionizes retrieval-augmented generation by introducing a two-step, draft-then-verify approach. It achieves state-of-the-art accuracy while significantly reducing latency, making it ideal for real-world applications requiring fast and reliable knowledge retrieval.

Footnotes

Wang, Z., Wang, Z., Le, L., Zheng, H. S., Mishra, S., Perot, V., Zhang, Y., Mattapalli, A., Taly, A., Shang, J., Lee, C.-Y., & Pfister, T. (2024). Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting. https://arxiv.org/abs/2407.08223 ↩

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.

On this page

How Does Speculative RAG Differ from Existing Techniques?
How Does Speculative RAG Work?
Main Benefits of Speculative RAG
Conclusion

DIFFICULTY LEVEL

RECOMMENDED COURSES

ChatGPT for Everyone

Introduction to Prompt Engineering

Live Courses

Speculative RAG