InFO-RAG
InFO-RAG is an unsupervised training method designed to enhance large language models (LLMs) in retrieval-augmented generation (RAG) by improving their ability to process and refine retrieved information. Instead of treating retrieved texts as direct references, InFO-RAG enables LLMs to refine, correct, and complete retrieved knowledge, making the final generated outputs more concise, accurate, and complete.
Key Concept: LLMs as "Information Refiners"
InFO-RAG introduces a new perspective where LLMs are considered information refiners rather than simple generators. This means LLMs should not only retrieve knowledge but also:
- Extract and refine useful information.
- Correct misinformation in retrieved texts.
- Complete missing knowledge using their internal understanding.
Why is InFO-RAG needed?
Despite the benefits of RAG, existing LLMs struggle with:
- Extracting relevant information from long and complex retrieved texts.
- Handling misinformation: LLMs are often misled by incorrect or incomplete retrieved texts.
- Generating answers without relevant retrieved content, relying only on their internal knowledge.
InFO-RAG addresses these problems by explicitly training LLMs to refine information, ensuring they can integrate retrieved texts with their internal knowledge for improved generation.
How InFO-RAG Differs from Existing Techniques
Compared to Standard RAG
Feature | Standard RAG | InFO-RAG |
---|---|---|
Uses retrieved texts directly | β Yes | β No |
Refines retrieved knowledge | β No | β Yes |
Handles incorrect/missing info | β No | β Yes |
Works in zero-shot settings | β Limited | β Yes |
Requires supervised fine-tuning | β Often | β No |
InFO-RAG does not rely on supervised fine-tuning but instead trains LLMs unsupervised, making it cost-effective and applicable across various tasks.
Compared to Prompt-Based RAG
Prompt-based methods try to improve RAG without updating model parameters, making them less effective in handling low-quality retrievals. InFO-RAG trains the model itself, meaning LLMs learn to refine retrieved texts inherently, leading to better generalization across tasks.
How InFO-RAG Works
1. Understanding the Three RAG Scenarios
InFO-RAG classifies retrieval scenarios into three types and defines specific learning objectives for each.
Scenario | Retrieved Texts Contain | InFO-RAG's Goal |
---|---|---|
1 | Complete and correct answers | Extract relevant knowledge and remove irrelevant text. |
2 | Incomplete or incorrect information | Correct errors and fill in missing knowledge. |
3 | No answer, only related context | Generate answers from internal knowledge. |
2. Unsupervised Training Strategy
InFO-RAG uses Wikipedia to simulate real-world retrieval scenarios and trains LLMs to refine information using three custom training tasks:
Task 1: Select and Copy (Scenario 1)
- Purpose: Teaches LLMs to extract concise and correct knowledge.
- How it works: Given a long retrieved text containing the correct answer, the LLM learns to generate a shorter, direct response.
Task 2: Correct and Complete (Scenario 2)
- Purpose: Trains LLMs to identify misinformation and complete missing facts.
- How it works: The retrieved text is corrupted (words replaced, masked, or altered), and the LLM learns to correct and complete the knowledge.
Task 3: Contextual Stimulation (Scenario 3)
- Purpose: Teaches LLMs to generate correct answers even when retrieval fails.
- How it works: Retrieved texts lack direct answers but contain related context, prompting LLMs to use internal knowledge to generate responses.
3. Training Process
- Uses prefix-based language modeling (similar to standard LLM training).
- Trains models without labeled data, keeping costs low.
- Uses LoRA fine-tuning, allowing InFO-RAG to be plug-and-play with existing LLMs.
You can find the InFO-RAG code on GitHub.
Main Benefits of InFO-RAG
- Significant accuracy gains: +9.39% on average across tasks.
- Better in-context learning: InFO-RAG enhances prompt-based retrieval strategies.
- More robust to misinformation: LLMs trained with InFO-RAG resist misleading retrieved content.
- Generalization across tasks: Works well across QA, language modeling, and code generation.
- Unsupervised & low-cost: Requires no manual labeling and uses existing data sources.
- Avoids catastrophic forgetting: Does not degrade LLM performance in non-RAG tasks.
Conclusion
InFO-RAG provides a paradigm shift in how we approach retrieval-augmented generation. Instead of treating retrieved texts as direct sources, it trains LLMs to refine, correct, and complete information. This results in more accurate, concise, and useful outputs, making LLMs far more effective and reliable in real-world applications.
Valeriia Kuka
Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.
Footnotes
-
Xu, S., Pang, L., Yu, M., Meng, F., Shen, H., Cheng, X., & Zhou, J. (2024). Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation. https://arxiv.org/abs/2402.18150 β©