FLARE / Active RAG
FLARE (Forward-Looking Active Retrieval-Augmented Generation) is a new technique that enhances Large Language Models (LLMs) by actively retrieving information during the generation process, rather than relying on a single retrieval step before generating an answer.
Most traditional RAG models retrieve documents once before generating a response, which works well for short-form answers but fails in long-form content generation where new information may be required along the way.
FLARE solves this by iteratively predicting future content, checking for low-confidence areas, retrieving relevant information, and refining the response dynamically.
Key Features of FLARE
- Retrieves information multiple times during generation
- Only retrieves when necessary (avoids irrelevant or excessive retrieval)
- Uses forward-looking generation to anticipate missing information
- Refines responses dynamically based on retrieved knowledge
Example of FLARE in Action
Task: Generate a summary of Joe Biden.
1οΈ. Initial retrieval: The model fetches some background on Biden.
2οΈ. Generation begins: The model generates a sentence:
- "Joe Biden attended the University of Pennsylvania, where he earned a law degree."
3οΈ. Confidence check: The model detects low confidence in the statement.
4οΈ. New retrieval triggered: The system fetches more information on Biden's education.
5οΈ. Sentence regeneration: Now, the model corrects the sentence to: - "Joe Biden graduated from the University of Delaware in 1965 with a Bachelor of Arts in history and political science."
Key Difference: Unlike regular RAG models that retrieve once at the start, FLARE refines its knowledge while generating, just like a human would when writing a research paper.
How FLARE Differs from Existing Techniques
Feature | Traditional RAG | Multi-Hop RAG | FLARE |
---|---|---|---|
Retrieval timing | Only before generation | Fixed intervals (e.g., every few tokens) | On-demand, based on need |
Decision to retrieve | Always retrieves | Passive retrieval at fixed points | Active retrieval only when needed |
Query generation | Uses input query | Uses previous context | Looks ahead to predict missing info |
Flexibility | One-shot retrieval | Limited adaptability | Fully dynamic, adjusts retrieval needs |
Unlike fixed-interval retrieval methods (which retrieve every tokens), FLARE is smarter and only retrieves when necessary. It also avoids retrieving irrelevant content, which can happen in other RAG methods.
How FLARE Works
- Step 1: Predict the Next Sentence: FLARE generates a temporary next sentence based on the input.
Example: "Joe Biden attended the University of Pennsylvania."
-
Step 2: Check for Low-Confidence Tokens: The model checks its confidence level for each generated token. If confidence is low, it means the model is unsure and might need more information.
-
Step 3: Retrieve Additional Information (if needed): If the model detects knowledge gaps, it retrieves relevant documents.
-
Step 4: Uses advanced query formulation:
- Implicit Query: Masks uncertain tokens (e.g., "Joe Biden attended [MASK]")
- Explicit Query: Generates a natural language question ("Which university did Joe Biden attend?")
-
Step 5: Regenerate the Sentence with Retrieved Information: The model updates the sentence with accurate knowledge.
Example: "Joe Biden graduated from the University of Delaware in 1965."
- Step 6: Repeat Until Completion: The model iteratively repeats this process, ensuring accuracy throughout long-form text generation.
The FLARE code and datasets are available on GitHub.
Results of FLARE
- FLARE outperforms existing RAG models across multiple tasks.
- Significant improvements in factual accuracy over standard RAG methods.
- Dynamically retrieves knowledge, avoiding unnecessary retrieval.
Conclusion
FLARE is a major step forward for Retrieval-Augmented Generation (RAG), making AI-generated text more accurate, reliable, and context-aware. By retrieving information only when needed and actively refining responses, FLARE achieves better factual accuracy and adaptability than previous methods.
Valeriia Kuka
Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.
Footnotes
-
Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Liu, Q., Dwivedi-Yu, J., Yang, Y., Callan, J., & Neubig, G. (2023). Active Retrieval Augmented Generation. https://arxiv.org/abs/2305.06983 β©