🌱 New Techniques👀 For Vision-Language Models (VLMs)◆ Conditional Prompt Learning (CoCoOp)

◆ Conditional Prompt Learning (CoCoOp) for Vision-Language Models

◆ This article is rated hard

Reading Time: 3 minutes

Last updated on October 1, 2024

Takeaways

CoCoOp Overview: Conditional Prompt Learning (CoCoOp) enhances vision-language models by generating dynamic prompts tailored to each input, improving generalization to unseen classes.
Generalization Improvement: CoCoOp addresses overfitting issues present in previous techniques like CoOp by using instance-specific tokens, allowing better performance across various tasks and datasets.
Versatile Applications: CoCoOp is particularly effective for fine-grained classification, scene recognition, and cross-dataset transfer, adapting models to changing data distributions and unseen categories.

What is Conditional Prompt Learning (CoCoOp)?

Conditional Prompt Learning (CoCoOp) is a technique that adapts pre-trained vision-language models like CLIP for new tasks. Vision-language models map images and text to a shared space, enabling open-world visual recognition. However, fine-tuning these models for specific tasks can be tricky.

A prior approach, Context Optimization (CoOp), introduced prompt learning by converting static words in a prompt into learnable vectors. This helps pre-trained models better adapt to new datasets, but it faces a problem—overfitting to the training classes. CoOp’s static prompts struggle to generalize to unseen categories, making them less useful when class distribution changes.

CoCoOp improves this by generating conditional prompts based on each input image. It adds a lightweight neural network (Meta-Net) that creates a dynamic token tailored to each image. These dynamic prompts are more flexible and can adapt better to unseen classes, enhancing generalization across tasks.

How does CoCoOp differ from existing techniques?

CoCoOp builds on CoOp by addressing the generalization problem. Here’s a comparison:

Technique	Approach	Key Feature	Generalization
CoOp	Static prompts with learnable vectors	Trained with specific base classes	Overfits to base classes
CoCoOp	Conditional prompts (dynamic)	Uses input-specific tokens	Better for unseen/new classes

Key Differences:

CoOp uses static context vectors optimized during training. Once trained, these prompts don’t change, making them sensitive to class shifts.
CoCoOp generates instance-specific tokens, meaning the prompt adapts to each image, improving performance on unseen data. This reduces overfitting and allows for better generalization, even beyond the dataset it was trained on.

How to use CoCoOp

CoCoOp is useful for tasks like image classification where data distributions change or where you need to apply the model to unseen categories. The technique can be applied to various visual recognition tasks, such as:

Fine-grained classification: Differentiating between similar categories (e.g., dog breeds).
Scene recognition: Recognizing different types of scenes (e.g., wind farm, train railway).
Cross-dataset transfer: Applying a model trained on one dataset (e.g., ImageNet) to a completely different dataset (e.g., OxfordPets).

Tip

For open-source code, check this link.

Results of CoCoOp

CoCoOp significantly improves generalization across multiple benchmarks compared to CoOp. It shows better performance in unseen classes and even in domain generalization scenarios. Here’s a summary of key results:

Base-to-New Class Generalization:

Method	Base Accuracy	New Accuracy	Harmonic Mean (H)
CLIP	69.34%	74.22%	71.70%
CoOp	82.69%	63.22%	71.66%
CoCoOp	80.47%	71.69%	75.83%

Harmonic Mean highlights the balance between generalization on base and unseen classes. CoCoOp consistently outperforms CoOp, with better accuracy on unseen classes.

Cross-Dataset Transfer (From ImageNet to 10 other datasets):

Method	Average Transfer Accuracy
CoOp	63.88%
CoCoOp	65.74%

CoCoOp demonstrates better transferability, particularly when the datasets have different categories or domain shifts.

Domain Generalization:

Method	ImageNet Sketch	ImageNet-A	ImageNet-R
CLIP	46.15%	47.77%	73.96%
CoOp	47.99%	49.71%	75.21%
CoCoOp	48.75%	50.63%	76.18%

CoCoOp improves robustness against domain shifts, outperforming both CLIP and CoOp on challenging benchmarks.

Conclusion

CoCoOp is a powerful extension of prompt learning for vision-language models, designed to tackle overfitting and enhance generalization. Its dynamic, input-conditional prompts adapt to unseen categories and tasks, offering more flexible and robust performance.

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.

Footnotes

Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Conditional Prompt Learning for Vision-Language Models. https://arxiv.org/abs/2203.05557 ↩