In vision-language models like CLIP, learning to prompt or prompt learning is a method for improving how models handle visual recognition tasks by optimizing how they are "prompted" to process images and text. In other words, it's prompt engineering tailored to vision-language models. Typically, vision-language models align images and texts in a shared feature space, allowing the models to classify new images by comparing them with text descriptions, rather than relying on pre-defined categories.
A major challenge with these models is prompt engineering, which involves finding the right words to describe image classes. This process can be time-consuming and requires expertise because small changes in wording can significantly affect performance.
Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.
Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to Prompt for Vision-Language Models. International Journal of Computer Vision, 130(9), 2337β2348. https://doi.org/10.1007/s11263-022-01653-1 β©