InFO-RAG

🟦 This article is rated medium

Reading Time: 3 minutes

Last updated on March 2, 2025

InFO-RAG is an unsupervised training method designed to enhance large language models (LLMs) in retrieval-augmented generation (RAG) by improving their ability to process and refine retrieved information. Instead of treating retrieved texts as direct references, InFO-RAG enables LLMs to refine, correct, and complete retrieved knowledge, making the final generated outputs more concise, accurate, and complete.

Key Concept: LLMs as "Information Refiners"

InFO-RAG introduces a new perspective where LLMs are considered information refiners rather than simple generators. This means LLMs should not only retrieve knowledge but also:

Extract and refine useful information.
Correct misinformation in retrieved texts.
Complete missing knowledge using their internal understanding.

Why is InFO-RAG needed?

Despite the benefits of RAG, existing LLMs struggle with:

Extracting relevant information from long and complex retrieved texts.
Handling misinformation: LLMs are often misled by incorrect or incomplete retrieved texts.
Generating answers without relevant retrieved content, relying only on their internal knowledge.

InFO-RAG addresses these problems by explicitly training LLMs to refine information, ensuring they can integrate retrieved texts with their internal knowledge for improved generation.

How InFO-RAG Differs from Existing Techniques

Compared to Standard RAG

Feature	Standard RAG	InFO-RAG
Uses retrieved texts directly	✅ Yes	❌ No
Refines retrieved knowledge	❌ No	✅ Yes
Handles incorrect/missing info	❌ No	✅ Yes
Works in zero-shot settings	❌ Limited	✅ Yes
Requires supervised fine-tuning	✅ Often	❌ No

InFO-RAG does not rely on supervised fine-tuning but instead trains LLMs unsupervised, making it cost-effective and applicable across various tasks.

Compared to Prompt-Based RAG

Prompt-based methods try to improve RAG without updating model parameters, making them less effective in handling low-quality retrievals. InFO-RAG trains the model itself, meaning LLMs learn to refine retrieved texts inherently, leading to better generalization across tasks.

How InFO-RAG Works

1. Understanding the Three RAG Scenarios

InFO-RAG classifies retrieval scenarios into three types and defines specific learning objectives for each.

Scenario	Retrieved Texts Contain	InFO-RAG's Goal
1	Complete and correct answers	Extract relevant knowledge and remove irrelevant text.
2	Incomplete or incorrect information	Correct errors and fill in missing knowledge.
3	No answer, only related context	Generate answers from internal knowledge.

2. Unsupervised Training Strategy

InFO-RAG uses Wikipedia to simulate real-world retrieval scenarios and trains LLMs to refine information using three custom training tasks:

Task 1: Select and Copy (Scenario 1)

Purpose: Teaches LLMs to extract concise and correct knowledge.
How it works: Given a long retrieved text containing the correct answer, the LLM learns to generate a shorter, direct response.

Task 2: Correct and Complete (Scenario 2)

Purpose: Trains LLMs to identify misinformation and complete missing facts.
How it works: The retrieved text is corrupted (words replaced, masked, or altered), and the LLM learns to correct and complete the knowledge.

Task 3: Contextual Stimulation (Scenario 3)

Purpose: Teaches LLMs to generate correct answers even when retrieval fails.
How it works: Retrieved texts lack direct answers but contain related context, prompting LLMs to use internal knowledge to generate responses.

3. Training Process

Uses prefix-based language modeling (similar to standard LLM training).
Trains models without labeled data, keeping costs low.
Uses LoRA fine-tuning, allowing InFO-RAG to be plug-and-play with existing LLMs.

Note

You can find the InFO-RAG code on GitHub.

Main Benefits of InFO-RAG

Significant accuracy gains: +9.39% on average across tasks.
Better in-context learning: InFO-RAG enhances prompt-based retrieval strategies.
More robust to misinformation: LLMs trained with InFO-RAG resist misleading retrieved content.
Generalization across tasks: Works well across QA, language modeling, and code generation.
Unsupervised & low-cost: Requires no manual labeling and uses existing data sources.
Avoids catastrophic forgetting: Does not degrade LLM performance in non-RAG tasks.

Conclusion

InFO-RAG provides a paradigm shift in how we approach retrieval-augmented generation. Instead of treating retrieved texts as direct sources, it trains LLMs to refine, correct, and complete information. This results in more accurate, concise, and useful outputs, making LLMs far more effective and reliable in real-world applications.

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.

Footnotes

Xu, S., Pang, L., Yu, M., Meng, F., Shen, H., Cheng, X., & Zhou, J. (2024). Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation. https://arxiv.org/abs/2402.18150 ↩

DIFFICULTY LEVEL

RECOMMENDED COURSES

ChatGPT for Everyone

Introduction to Prompt Engineering

AI Red-Teaming and AI Security Masterclass

Live AI Security Courses