Prompt Engineering Guide
πŸ˜ƒ Basics
πŸ’Ό Applications
πŸ§™β€β™‚οΈ Intermediate
🧠 Advanced
Special Topics
🌱 New Techniques
πŸ€– Agents
βš–οΈ Reliability
πŸ–ΌοΈ Image Prompting
πŸ”“ Prompt Hacking
πŸ” Language Model Inversion
πŸ”¨ Tooling
πŸ’ͺ Prompt Tuning
πŸ—‚οΈ RAG
πŸ”§ Models
🎲 Miscellaneous
πŸ“™ Vocabulary Resource
πŸ“š Bibliography
πŸ“¦ Prompted Products
πŸ›Έ Additional Resources
πŸ”₯ Hot Topics
✨ Credits
πŸ—‚οΈ RAG🟦 FLARE / Active RAG

FLARE / Active RAG

🟦 This article is rated medium
Reading Time: 3 minutes
Last updated on March 2, 2025

Valeriia Kuka

FLARE (Forward-Looking Active Retrieval-Augmented Generation) is a new technique that enhances Large Language Models (LLMs) by actively retrieving information during the generation process, rather than relying on a single retrieval step before generating an answer.

Most traditional RAG models retrieve documents once before generating a response, which works well for short-form answers but fails in long-form content generation where new information may be required along the way.

FLARE solves this by iteratively predicting future content, checking for low-confidence areas, retrieving relevant information, and refining the response dynamically.

Key Features of FLARE

  • Retrieves information multiple times during generation
  • Only retrieves when necessary (avoids irrelevant or excessive retrieval)
  • Uses forward-looking generation to anticipate missing information
  • Refines responses dynamically based on retrieved knowledge

Example of FLARE in Action

Task: Generate a summary of Joe Biden.

1️. Initial retrieval: The model fetches some background on Biden.
2️. Generation begins: The model generates a sentence:

  • "Joe Biden attended the University of Pennsylvania, where he earned a law degree."
    3️. Confidence check: The model detects low confidence in the statement.
    4️. New retrieval triggered: The system fetches more information on Biden's education.
    5️. Sentence regeneration: Now, the model corrects the sentence to:
  • "Joe Biden graduated from the University of Delaware in 1965 with a Bachelor of Arts in history and political science."

Key Difference: Unlike regular RAG models that retrieve once at the start, FLARE refines its knowledge while generating, just like a human would when writing a research paper.

How FLARE Differs from Existing Techniques

FeatureTraditional RAGMulti-Hop RAGFLARE
Retrieval timingOnly before generationFixed intervals (e.g., every few tokens)On-demand, based on need
Decision to retrieveAlways retrievesPassive retrieval at fixed pointsActive retrieval only when needed
Query generationUses input queryUses previous contextLooks ahead to predict missing info
FlexibilityOne-shot retrievalLimited adaptabilityFully dynamic, adjusts retrieval needs

Unlike fixed-interval retrieval methods (which retrieve every NN tokens), FLARE is smarter and only retrieves when necessary. It also avoids retrieving irrelevant content, which can happen in other RAG methods.

How FLARE Works

  • Step 1: Predict the Next Sentence: FLARE generates a temporary next sentence based on the input.

Example: "Joe Biden attended the University of Pennsylvania."

  • Step 2: Check for Low-Confidence Tokens: The model checks its confidence level for each generated token. If confidence is low, it means the model is unsure and might need more information.

  • Step 3: Retrieve Additional Information (if needed): If the model detects knowledge gaps, it retrieves relevant documents.

  • Step 4: Uses advanced query formulation:

    • Implicit Query: Masks uncertain tokens (e.g., "Joe Biden attended [MASK]")
    • Explicit Query: Generates a natural language question ("Which university did Joe Biden attend?")
  • Step 5: Regenerate the Sentence with Retrieved Information: The model updates the sentence with accurate knowledge.

Example: "Joe Biden graduated from the University of Delaware in 1965."

  • Step 6: Repeat Until Completion: The model iteratively repeats this process, ensuring accuracy throughout long-form text generation.
Note

The FLARE code and datasets are available on GitHub.

Results of FLARE

  • FLARE outperforms existing RAG models across multiple tasks.
  • Significant improvements in factual accuracy over standard RAG methods.
  • Dynamically retrieves knowledge, avoiding unnecessary retrieval.

Conclusion

FLARE is a major step forward for Retrieval-Augmented Generation (RAG), making AI-generated text more accurate, reliable, and context-aware. By retrieving information only when needed and actively refining responses, FLARE achieves better factual accuracy and adaptability than previous methods.

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.

Footnotes

  1. Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Liu, Q., Dwivedi-Yu, J., Yang, Y., Callan, J., & Neubig, G. (2023). Active Retrieval Augmented Generation. https://arxiv.org/abs/2305.06983 ↩