FLARE / Active RAG

🟦 This article is rated medium

Reading Time: 3 minutes

Last updated on March 2, 2025

FLARE (Forward-Looking Active Retrieval-Augmented Generation) is a new technique that enhances Large Language Models (LLMs) by actively retrieving information during the generation process, rather than relying on a single retrieval step before generating an answer.

Most traditional RAG models retrieve documents once before generating a response, which works well for short-form answers but fails in long-form content generation where new information may be required along the way.

FLARE solves this by iteratively predicting future content, checking for low-confidence areas, retrieving relevant information, and refining the response dynamically.

Key Features of FLARE

Retrieves information multiple times during generation
Only retrieves when necessary (avoids irrelevant or excessive retrieval)
Uses forward-looking generation to anticipate missing information
Refines responses dynamically based on retrieved knowledge

Example of FLARE in Action

Task: Generate a summary of Joe Biden.

1️. Initial retrieval: The model fetches some background on Biden.
2️. Generation begins: The model generates a sentence:

"Joe Biden attended the University of Pennsylvania, where he earned a law degree."
3️. Confidence check: The model detects low confidence in the statement.
4️. New retrieval triggered: The system fetches more information on Biden's education.
5️. Sentence regeneration: Now, the model corrects the sentence to:
"Joe Biden graduated from the University of Delaware in 1965 with a Bachelor of Arts in history and political science."

Key Difference: Unlike regular RAG models that retrieve once at the start, FLARE refines its knowledge while generating, just like a human would when writing a research paper.

How FLARE Differs from Existing Techniques

Feature	Traditional RAG	Multi-Hop RAG	FLARE
Retrieval timing	Only before generation	Fixed intervals (e.g., every few tokens)	On-demand, based on need
Decision to retrieve	Always retrieves	Passive retrieval at fixed points	Active retrieval only when needed
Query generation	Uses input query	Uses previous context	Looks ahead to predict missing info
Flexibility	One-shot retrieval	Limited adaptability	Fully dynamic, adjusts retrieval needs

Unlike fixed-interval retrieval methods (which retrieve every $N$ tokens), FLARE is smarter and only retrieves when necessary. It also avoids retrieving irrelevant content, which can happen in other RAG methods.

How FLARE Works

Step 1: Predict the Next Sentence: FLARE generates a temporary next sentence based on the input.

Example: "Joe Biden attended the University of Pennsylvania."

Step 2: Check for Low-Confidence Tokens: The model checks its confidence level for each generated token. If confidence is low, it means the model is unsure and might need more information.
Step 3: Retrieve Additional Information (if needed): If the model detects knowledge gaps, it retrieves relevant documents.
Step 4: Uses advanced query formulation:
- Implicit Query: Masks uncertain tokens (e.g., "Joe Biden attended [MASK]")
- Explicit Query: Generates a natural language question ("Which university did Joe Biden attend?")
Step 5: Regenerate the Sentence with Retrieved Information: The model updates the sentence with accurate knowledge.

Example: "Joe Biden graduated from the University of Delaware in 1965."

Step 6: Repeat Until Completion: The model iteratively repeats this process, ensuring accuracy throughout long-form text generation.

Note

The FLARE code and datasets are available on GitHub.

Results of FLARE

FLARE outperforms existing RAG models across multiple tasks.
Significant improvements in factual accuracy over standard RAG methods.
Dynamically retrieves knowledge, avoiding unnecessary retrieval.

Conclusion

FLARE is a major step forward for Retrieval-Augmented Generation (RAG), making AI-generated text more accurate, reliable, and context-aware. By retrieving information only when needed and actively refining responses, FLARE achieves better factual accuracy and adaptability than previous methods.

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.

Footnotes

Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Liu, Q., Dwivedi-Yu, J., Yang, Y., Callan, J., & Neubig, G. (2023). Active Retrieval Augmented Generation. https://arxiv.org/abs/2305.06983 ↩

DIFFICULTY LEVEL

RECOMMENDED COURSES

ChatGPT for Everyone

Introduction to Prompt Engineering

AI Red-Teaming and AI Security Masterclass

Live AI Security Courses