Announcing our new Course: AI Red-Teaming and AI Safety Masterclass

Check it out →
🧠 Advanced
🌱 New Techniques👀 For Vision-Language Models (VLMs)◆ Context Optimization (CoOp)

◆ Context Optimization (CoOp) for Vision-Language Models

Last updated on October 1, 2024 by Valeriia Kuka

Context Optimization (CoOp)?

Context Optimization (CoOp)1 is one of the first methods in the field of prompt learning. It automates the process of generating prompts. Instead of manually tuning words, CoOp uses learnable vectors to model context, making it easier and faster to adapt the model to different image classification tasks.

How CoOp Works:

  1. Unified Context: CoOp generates prompts that share the same context across all image classes.
  2. Class-Specific Context: It can also create prompts tailored to individual classes, which is useful for fine-grained tasks.

By keeping the pre-trained model parameters fixed and only learning the prompt context, CoOp requires very few labeled images (shots) to outperform manually crafted prompts.

How This Technique Differs from Existing Techniques

  • Traditional Prompt Engineering: Relies on manually crafted prompts, which require expertise and extensive tuning.
  • Zero-Shot Models: Use fixed hand-crafted prompts and generalize to new tasks without additional training, but lack flexibility.
  • CoOp: Automates the prompt design process and enhances performance even with a small number of training examples (one or two shots), making it more efficient and adaptable to various image recognition tasks.

CoOp learns continuously from data, unlike zero-shot methods, and still maintains strong domain generalization, meaning it can handle tasks across different datasets without significant loss of performance.

How to Use CoOp

CoOp can be applied to vision-language models like CLIP for various downstream image classification tasks. Here is an example of how it can be used with the two prompt types:

Example with Unified Context:

Astronaut

Prompt


[V]1 [V]2 ... [V]M [CLASS]

Where [V]1 ... [V]M are the learnable context vectors, MM is a hyperparameter specifying the number of context tokens, and [CLASS] is the target class name (e.g., "cat").

Example with Class-Specific Context:

Astronaut

Prompt


[V_class1]1 [V_class1]2 ... [V_class1]M for "cat"

Here, each class (e.g., "cat" and "dog") has its own set of context vectors optimized for that class.

Tip

For open-source code, check this link.

Results of CoOp

CoOp has been tested on 11 datasets, covering a wide range of visual tasks including object recognition, fine-grained classification (e.g., flowers, pets), and more specialized tasks like texture and satellite image recognition.

Comparison with Other Fine-Tuning Methods

CoOp significantly outperforms other fine-tuning methods on ImageNet when using 16 training examples per class. In particular, it achieves a 4.77% increase over the zero-shot performance of CLIP, which starts at 58.18%. Other methods, such as fine-tuning the image encoder or optimizing specific text transformations, show smaller improvements or even performance drops.

MethodImageNet Accuracy∆ with Zero-shot
Zero-shot CLIP58.18-
Linear probe55.87-2.31
Fine-tuning image encoder18.28-39.90
Optimizing transformation layer58.86+0.68
Optimizing bias (text)60.93+2.75
CoOp62.95+4.77

Key Takeaways:

  • CoOp provides significant improvements in fine-tuning vision-language models, especially in scenarios where domain shift occurs.
  • It effectively bridges the performance gap between zero-shot CLIP and supervised models, especially on datasets like DOSCO-2k and ImageNet.
  • The flexibility of learned context tokens allows CoOp to generalize better across varied tasks, making it a powerful tool for domain generalization and few-shot learning.

Footnotes

  1. Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to Prompt for Vision-Language Models. International Journal of Computer Vision, 130(9), 2337–2348. https://doi.org/10.1007/s11263-022-01653-1

Edit this page
Word count: 0
Copyright © 2024 Learn Prompting.