Speculative RAG
Speculative RAG is a novel Retrieval-Augmented Generation (RAG) framework designed to improve both the accuracy and efficiency of large language models (LLMs) when answering knowledge-intensive questions. Traditional RAG approaches suffer from long input contexts and high latency, as they require incorporating all retrieved documents into the prompt.
Speculative RAG overcomes these challenges by introducing a two-step process:
- Drafting Phase: A smaller, specialized model (the RAG Drafter) generates multiple answer drafts, each based on a subset of retrieved documents.
- Verification Phase: A larger, generalist model (the RAG Verifier) evaluates the drafts and selects the best one, reducing computational overhead.
This method not only speeds up response generation but also ensures higher quality answers by incorporating diverse perspectives from different document subsets.
How Does Speculative RAG Differ from Existing Techniques?
Method | Strengths | Weaknesses |
---|---|---|
Standard RAG | Uses all retrieved documents, maximizing context coverage | Slow, inefficient due to long input sequences |
Self-Reflective RAG | Uses instruction-tuning to critique retrieved content | Requires additional fine-tuning of the LLM |
Corrective RAG | Uses an external retrieval evaluator to filter docs | Lacks advanced reasoning capabilities |
Speculative RAG | Uses smaller RAG drafter for drafting and generalist LM for verification | More complex pipeline but faster and more accurate |
Unlike Self-Reflective RAG, Speculative RAG does not require expensive instruction-tuning of the generalist LM. Unlike Corrective RAG, it enhances reasoning ability by integrating multiple perspectives in parallel.
How Does Speculative RAG Work?
Step 1: Retrieve Documents
For a given query Q, retrieve a set of documents D from an external database.
Step 2: Multi-Perspective Sampling
- Cluster retrieved documents based on content similarity.
- Sample one document per cluster to form diverse subsets of evidence.
Step 3: Generate Draft Answers
- Pass each document subset to a smaller, specialized RAG Drafter.
- The RAG Drafter produces:
- Answer Draft (): A possible response.
- Rationale (): Justification for the answer.
Step 4: Evaluate Drafts with a Generalist LM
- The RAG Verifier (a larger, generalist LM) evaluates drafts using:
- Self-Consistency Score (Does the draft logically follow from the rationale?)
- Self-Reflection Score (Does the rationale properly support the answer?)
- Draft Confidence Score (Likelihood that the generated answer is correct)
- The highest-scoring draft is selected as the final answer.
Main Benefits of Speculative RAG
- Higher accuracy: Ensures evidence-backed answers by leveraging diverse document subsets.
- Lower latency: Parallel processing reduces computation time by up to 51%.
- No instruction-tuning for the generalist LM: Avoids expensive retraining.
- Handles diverse perspectives: Reduces the "lost-in-the-middle" problem of long-context retrieval.
Conclusion
Speculative RAG revolutionizes retrieval-augmented generation by introducing a two-step, draft-then-verify approach. It achieves state-of-the-art accuracy while significantly reducing latency, making it ideal for real-world applications requiring fast and reliable knowledge retrieval.
Valeriia Kuka
Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.
Footnotes
-
Wang, Z., Wang, Z., Le, L., Zheng, H. S., Mishra, S., Perot, V., Zhang, Y., Mattapalli, A., Taly, A., Shang, J., Lee, C.-Y., & Pfister, T. (2024). Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting. https://arxiv.org/abs/2407.08223 β©