Announcing our new Course: AI Red-Teaming and AI Safety Masterclass
Check it out →Attention Prompting on Image is a novel technique designed to enhance the performance of Vision-Language Models (VLMs) on vision-language tasks. It does this by overlaying a text-query-guided attention heatmap on the original image, which helps guide the model's focus to relevant areas of the image based on the specific question or task at hand.
Attention Prompting on Image can be integrated into any VLM pipeline that processes both image and text. Below is a simplified workflow:
For implementation, you can check the original open-source code on GitHub.
Attention Prompting on Image significantly improves performance on several benchmarks by helping the VLM focus on relevant parts of the image based on the text query.
Model | Dataset | No Prompt (%) | Attention Prompting on Image (%) | Improvement (%) |
---|---|---|---|---|
LLaVA-1.5 | MM-Vet | 32.8 | 36.6 | +3.8 |
LLaVA-1.5 | LLaVA-Bench | 71.9 | 74.8 | +2.9 |
GPT-4V | VisWiz | 59.4 | 71.01 | +11.6 |
Attention Prompting on Image (Attention Prompting on Image) enhances the performance of LVLMs by providing query-specific attention heatmaps, which help the models better interpret and respond to vision-language tasks. It improves accuracy across a range of benchmarks, reducing hallucination and enhancing self-reflection. Attention Prompting on Image opens new avenues for using visual signals to improve vision-language models without requiring complex training or fine-tuning.