Mixture of Prompt Learning (MoCoOp) is a technique designed to improve the performance of pre-trained vision-language models (VLMs) like CLIP in various downstream tasks. This method enhances prompt learning, a technique that fine-tunes text prompts (inputs) to guide the VLMs toward better results in tasks like image classification.
Traditionally, prompts can be hard prompts (manually designed, static templates) or soft prompts (learnable vectors). Soft prompts are more adaptable but face two key challenges:
MoCoOp tackles these issues by incorporating multiple soft prompts, each representing different styles, and uses a routing module to dynamically select the most appropriate prompt for each image. This is combined with a gating mechanism that ensures the selected prompts align with the knowledge from hard prompts, helping reduce overfitting.
MoCoOp stands out by dynamically selecting soft prompts to better match the diversity of the dataset, while also incorporating hard prompts to maintain generalizable knowledge. Here’s a comparison with existing approaches:
Technique | Approach | Key Features | Generalization |
---|---|---|---|
CoOp | Static soft prompts | Learnable prompts for better task adaptation | Prone to overfitting on new classes |
CoCoOp | Dynamic soft prompts | Instance-specific prompt generation | Improved unseen class performance |
MoCoOp | Mixture of soft prompts | Multiple prompts with routing for better match | Strongest, especially in complex datasets |
MoCoOp is particularly useful for tasks that involve datasets with a wide variety of styles or patterns and for applications where the model needs to generalize to new or unseen classes. Typical applications include:
For open-source code, check this link.
MoCoOp has shown significant improvements in Few-Shot learning, base-to-new class generalization, and domain generalization across multiple benchmarks.
Method | Base Accuracy | New Accuracy | Harmonic Mean (H) |
---|---|---|---|
CLIP | 70.25% | 74.22% | 71.57% |
CoOp | 82.64% | 68.00% | 74.02% |
MoCoOp | 83.32% | 77.34% | 80.17% |
MoCoOp consistently outperforms other methods like CoOp, LASP, and CoCoOp in Few-Shot learning scenarios, with significant gains across datasets like Caltech101, Flowers102, and Stanford Cars.
MoCoOp effectively addresses key challenges in prompt learning for vision-language models by introducing a mixture of soft prompts and dynamic selection via a routing module. This approach significantly improves performance in few-shot learning, base-to-new class generalization, and domain generalization, outperforming existing methods across various datasets.
Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.
Du, Y., Niu, T., & Zhao, R. (2024). Mixture of Prompt Learning for Vision Language Models. https://arxiv.org/abs/2409.12011 ↩