Multitask Prompt Tuning
Multitask Prompt Tuning (MPT) is a parameter-efficient transfer learning approach designed to adapt large language models (LLMs) to a wide range of downstream tasks using shared soft prompts.
In contrast to traditional fine-tuning, MPT leverages prompt tuning, where only a small set of task-specific continuous vectors (soft prompts) are optimized.
Many tasks share overlapping knowledge and common patterns. Learning individual task-specific prompts from scratch for every new task can be redundant and inefficient. This is where Multitask Prompt Tuning (MPT) makes a significant contribution.
MPT combines multitask learning with low-rank adaptation to achieve exceptional efficiency, using as little as 0.035% of model parameters per task.
Core Ideas of MPT
-
Multitask Learning for Prompts: MPT first distills information from multiple source tasks instead of training separate soft prompts for each task. By decomposing each prompt into shared and task-specific components, MPT extracts a single, transferable prompt that encapsulates general knowledge across tasks. This shared prompt can then be adapted to individual target tasks using efficient, low-rank multiplicative updates.
-
Low-Rank Adaptation: The soft prompt matrix often contains redundant information. MPT applies low-rank factorization to the prompt matrix, representing it as the product of two smaller matrices. For example, rather than learning a full prompt matrix of size , MPT factorizes it into (size ) and (size ), where is much smaller than . This reduces the number of trainable parameters from to , without compromising performance.
How MPT Works: A Two-Stage Process
The process consists of two main stages:
-
Knowledge Distillation: MPT trains soft prompts on various source tasks (such as MNLI, QNLI, SST-2) using standard prompt tuning methods. Each task-specific prompt is decomposed into shared and unique components. The system then distills knowledge from these individual prompts into a single, transferable shared prompt that captures common patterns across all tasks. This step reduces redundancy and creates a robust foundation for prompt adaptation.
-
Task-Specific Adaptation: The shared prompt, containing distilled general-purpose knowledge, is transferred to new tasks. For each target task, MPT applies lightweight, low-rank multiplicative updates to this shared prompt. This adaptation process requires minimal additional parameters (about 0.035% per task) while effectively customizing the prompt for each specific task.
Advantages Over Other Techniques
MPT achieves an optimal balance between efficiency and performance. Here's how it compares to other techniques:
Technique | Trainable Parameters | Strengths | Weaknesses |
---|---|---|---|
Full Fine-Tuning | 100% of model parameters | High accuracy | Computationally expensive, storage-heavy |
Adapters | Small added layers | Efficient, good performance | Requires additional computation |
BitFit | Only updates biases | Very lightweight | Lower performance on some tasks |
Vanilla Prompt Tuning | ~0.01% of model parameters | Efficient, scalable | Needs careful initialization, slower convergence |
SPoT | Single best prompt | Efficient reuse | Suboptimal for unseen tasks |
ATTEMPT | Attention-based transfer | Strong performance | Uses more parameters than MPT |
MPT | 0.035% per task | Best efficiency-performance tradeoff; effective knowledge sharing | Requires initial multitask training |
Key Benefits of MPT
- Parameter Efficiency: Uses only 0.035% of parameters per task, significantly reducing storage and computation costs
- Strong Performance: Achieves high accuracy across diverse benchmarks through effective knowledge transfer
- Few-Shot Learning: Performs well with limited training examples due to the shared prompt's generalized knowledge
- Deployment Flexibility: Enables serving multiple tasks by swapping task-specific updates while maintaining a single base model
When to Use MPT
Consider using MPT in these scenarios:
- You need one model to efficiently handle multiple tasks
- You have limited computational resources and storage
- Your training data is limited
- You want to transfer knowledge between classification and generation tasks
Conclusion
Multitask Prompt Tuning (MPT) provides an efficient solution for adapting large language models to multiple tasks. Through shared knowledge distillation and low-rank updates, MPT reduces per-task parameters to 0.035% while maintaining strong performance. This approach makes MPT particularly valuable for resource-constrained environments and multi-task applications.
Valeriia Kuka
Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.
Footnotes
-
Wang, Z., Panda, R., Karlinsky, L., Feris, R., Sun, H., & Kim, Y. (2023). Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning. https://arxiv.org/abs/2303.02861 β©