Conditional Prompt Learning (CoCoOp) is a technique that adapts pre-trained vision-language models like CLIP for new tasks. Vision-language models map images and text to a shared space, enabling open-world visual recognition. However, fine-tuning these models for specific tasks can be tricky.
A prior approach, Context Optimization (CoOp), introduced prompt learning by converting static words in a prompt into learnable vectors. This helps pre-trained models better adapt to new datasets, but it faces a problem—overfitting to the training classes. CoOp’s static prompts struggle to generalize to unseen categories, making them less useful when class distribution changes.
CoCoOp improves this by generating conditional prompts based on each input image. It adds a lightweight neural network (Meta-Net) that creates a dynamic token tailored to each image. These dynamic prompts are more flexible and can adapt better to unseen classes, enhancing generalization across tasks.
CoCoOp builds on CoOp by addressing the generalization problem. Here’s a comparison:
Technique | Approach | Key Feature | Generalization |
---|---|---|---|
CoOp | Static prompts with learnable vectors | Trained with specific base classes | Overfits to base classes |
CoCoOp | Conditional prompts (dynamic) | Uses input-specific tokens | Better for unseen/new classes |
CoCoOp is useful for tasks like image classification where data distributions change or where you need to apply the model to unseen categories. The technique can be applied to various visual recognition tasks, such as:
For open-source code, check this link.
CoCoOp significantly improves generalization across multiple benchmarks compared to CoOp. It shows better performance in unseen classes and even in domain generalization scenarios. Here’s a summary of key results:
Method | Base Accuracy | New Accuracy | Harmonic Mean (H) |
---|---|---|---|
CLIP | 69.34% | 74.22% | 71.70% |
CoOp | 82.69% | 63.22% | 71.66% |
CoCoOp | 80.47% | 71.69% | 75.83% |
Method | Average Transfer Accuracy |
---|---|
CoOp | 63.88% |
CoCoOp | 65.74% |
CoCoOp demonstrates better transferability, particularly when the datasets have different categories or domain shifts.
Method | ImageNet Sketch | ImageNet-A | ImageNet-R |
---|---|---|---|
CLIP | 46.15% | 47.77% | 73.96% |
CoOp | 47.99% | 49.71% | 75.21% |
CoCoOp | 48.75% | 50.63% | 76.18% |
CoCoOp is a powerful extension of prompt learning for vision-language models, designed to tackle overfitting and enhance generalization. Its dynamic, input-conditional prompts adapt to unseen categories and tasks, offering more flexible and robust performance.
Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Conditional Prompt Learning for Vision-Language Models. https://arxiv.org/abs/2203.05557 ↩