Multitask Prompt Tuning (MPT) is a parameter-efficient transfer learning approach designed to adapt large language models (LLMs) to a wide range of downstream tasks using shared soft prompts.

In contrast to traditional fine-tuning, MPT leverages prompt tuning, where only a small set of task-specific continuous vectors (soft prompts) are optimized.

Many tasks share overlapping knowledge and common patterns. Learning individual task-specific prompts from scratch for every new task can be redundant and inefficient. This is where Multitask Prompt Tuning (MPT) makes a significant contribution.

MPT combines multitask learning with low-rank adaptation to achieve exceptional efficiency, using as little as 0.035% of model parameters per task.

Core Ideas of MPT

Multitask Learning for Prompts: MPT first distills information from multiple source tasks instead of training separate soft prompts for each task. By decomposing each prompt into shared and task-specific components, MPT extracts a single, transferable prompt that encapsulates general knowledge across tasks. This shared prompt can then be adapted to individual target tasks using efficient, low-rank multiplicative updates.
Low-Rank Adaptation: The soft prompt matrix often contains redundant information. MPT applies low-rank factorization to the prompt matrix, representing it as the product of two smaller matrices. For example, rather than learning a full prompt matrix $X$ of size $n \times d$ , MPT factorizes it into $U$ (size $n \times r$ ) and $V$ (size $r \times d$ ), where $r$ is much smaller than $d$ . This reduces the number of trainable parameters from $n \times d$ to $r(n + d)$ , without compromising performance.

How MPT Works: A Two-Stage Process

The process consists of two main stages:

Knowledge Distillation: MPT trains soft prompts on various source tasks (such as MNLI, QNLI, SST-2) using standard prompt tuning methods. Each task-specific prompt is decomposed into shared and unique components. The system then distills knowledge from these individual prompts into a single, transferable shared prompt that captures common patterns across all tasks. This step reduces redundancy and creates a robust foundation for prompt adaptation.
Task-Specific Adaptation: The shared prompt, containing distilled general-purpose knowledge, is transferred to new tasks. For each target task, MPT applies lightweight, low-rank multiplicative updates to this shared prompt. This adaptation process requires minimal additional parameters (about 0.035% per task) while effectively customizing the prompt for each specific task.

Advantages Over Other Techniques

MPT achieves an optimal balance between efficiency and performance. Here's how it compares to other techniques:

Technique	Trainable Parameters	Strengths	Weaknesses
Full Fine-Tuning	100% of model parameters	High accuracy	Computationally expensive, storage-heavy
Adapters	Small added layers	Efficient, good performance	Requires additional computation
BitFit	Only updates biases	Very lightweight	Lower performance on some tasks
Vanilla Prompt Tuning	~0.01% of model parameters	Efficient, scalable	Needs careful initialization, slower convergence
SPoT	Single best prompt	Efficient reuse	Suboptimal for unseen tasks
ATTEMPT	Attention-based transfer	Strong performance	Uses more parameters than MPT
MPT	0.035% per task	Best efficiency-performance tradeoff; effective knowledge sharing	Requires initial multitask training

Key Benefits of MPT

Parameter Efficiency: Uses only 0.035% of parameters per task, significantly reducing storage and computation costs
Strong Performance: Achieves high accuracy across diverse benchmarks through effective knowledge transfer
Few-Shot Learning: Performs well with limited training examples due to the shared prompt's generalized knowledge
Deployment Flexibility: Enables serving multiple tasks by swapping task-specific updates while maintaining a single base model

When to Use MPT

Consider using MPT in these scenarios:

You need one model to efficiently handle multiple tasks
You have limited computational resources and storage
Your training data is limited
You want to transfer knowledge between classification and generation tasks

Conclusion

Multitask Prompt Tuning (MPT) provides an efficient solution for adapting large language models to multiple tasks. Through shared knowledge distillation and low-rank updates, MPT reduces per-task parameters to 0.035% while maintaining strong performance. This approach makes MPT particularly valuable for resource-constrained environments and multi-task applications.

Footnotes

Wang, Z., Panda, R., Karlinsky, L., Feris, R., Sun, H., & Kim, Y. (2023). Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning. https://arxiv.org/abs/2303.02861 ↩

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.

On this page

Core Ideas of MPT
How MPT Works: A Two-Stage Process
Advantages Over Other Techniques
When to Use MPT
Conclusion

DIFFICULTY LEVEL

RECOMMENDED COURSES

ChatGPT for Everyone

Introduction to Prompt Engineering

Live Courses

Multitask Prompt Tuning