🌱 New Techniques👀 For Vision-Language Models (VLMs)◆ Context Optimization (CoOp)

◆ Context Optimization (CoOp) for Vision-Language Models

◆ This article is rated hard

Reading Time: 3 minutes

Last updated on October 1, 2024

Takeaways

Significant Improvements: CoOp enhances fine-tuning for vision-language models, particularly in cases of domain shifts.
Bridging Performance Gaps: It effectively narrows the gap between zero-shot CLIP and supervised models, showing strong results on datasets like DOSCO-2k and ImageNet.
Generalization Flexibility: The adaptability of learned context tokens enables CoOp to generalize effectively across diverse tasks, positioning it as a valuable tool for domain generalization and few-shot learning.

Context Optimization (CoOp)?

Context Optimization (CoOp) is one of the first methods in the field of prompt learning. It automates the process of generating prompts. Instead of manually tuning words, CoOp uses learnable vectors to model context, making it easier and faster to adapt the model to different image classification tasks.

How CoOp Works:

Unified Context: CoOp generates prompts that share the same context across all image classes.
Class-Specific Context: It can also create prompts tailored to individual classes, which is useful for fine-grained tasks.

By keeping the pre-trained model parameters fixed and only learning the prompt context, CoOp requires very few labeled images (shots) to outperform manually crafted prompts.

How This Technique Differs from Existing Techniques

Traditional Prompt Engineering: Relies on manually crafted prompts, which require expertise and extensive tuning.
Zero-Shot Models: Use fixed hand-crafted prompts and generalize to new tasks without additional training, but lack flexibility.
CoOp: Automates the prompt design process and enhances performance even with a small number of training examples (one or two shots), making it more efficient and adaptable to various image recognition tasks.

CoOp learns continuously from data, unlike Zero-Shot methods, and still maintains strong domain generalization, meaning it can handle tasks across different datasets without significant loss of performance.

How to Use CoOp

CoOp can be applied to vision-language models like CLIP for various downstream image classification tasks. Here is an example of how it can be used with the two prompt types:

Example with Unified Context:

Prompt

[V]1 [V]2 ... [V]M [CLASS]

Where [V]1 ... [V]M are the learnable context vectors, $M$ is a hyperparameter specifying the number of context tokens, and [CLASS] is the target class name (e.g., "cat").

Example with Class-Specific Context:

Prompt

[V_class1]1 [V_class1]2 ... [V_class1]M for "cat"

Here, each class (e.g., "cat" and "dog") has its own set of context vectors optimized for that class.

Tip

For open-source code, check this link.

Results of CoOp

CoOp has been tested on 11 datasets, covering a wide range of visual tasks including object recognition, fine-grained classification (e.g., flowers, pets), and more specialized tasks like texture and satellite image recognition.

Comparison with Other Fine-Tuning Methods

CoOp significantly outperforms other fine-tuning methods on ImageNet when using 16 training examples per class. In particular, it achieves a 4.77% increase over the Zero-Shot performance of CLIP, which starts at 58.18%. Other methods, such as fine-tuning the image encoder or optimizing specific text transformations, show smaller improvements or even performance drops.

Method	ImageNet Accuracy	∆ with Zero-Shot
Zero-Shot CLIP	58.18	-
Linear probe	55.87	-2.31
Fine-tuning image encoder	18.28	-39.90
Optimizing transformation layer	58.86	+0.68
Optimizing bias (text)	60.93	+2.75
CoOp	62.95	+4.77

Key Takeaways:

CoOp provides significant improvements in fine-tuning vision-language models, especially in scenarios where domain shift occurs.
It effectively bridges the performance gap between zero-shot CLIP and supervised models, especially on datasets like DOSCO-2k and ImageNet.
The flexibility of learned context tokens allows CoOp to generalize better across varied tasks, making it a powerful tool for domain generalization and few-shot learning.

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.

Footnotes

Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to Prompt for Vision-Language Models. International Journal of Computer Vision, 130(9), 2337–2348. https://doi.org/10.1007/s11263-022-01653-1 ↩