Prompt Engineering Guide
πŸ˜ƒ Basics
πŸ’Ό Applications
πŸ§™β€β™‚οΈ Intermediate
🧠 Advanced
Special Topics
🌱 New Techniques
πŸ€– Agents
βš–οΈ Reliability
πŸ–ΌοΈ Image Prompting
πŸ”“ Prompt Hacking
πŸ” Language Model Inversion
πŸ”¨ Tooling
πŸ’ͺ Prompt Tuning
πŸ—‚οΈ RAG
πŸ”§ Models
🎲 Miscellaneous
πŸ“™ Vocabulary Resource
πŸ“š Bibliography
πŸ“¦ Prompted Products
πŸ›Έ Additional Resources
πŸ”₯ Hot Topics
✨ Credits
πŸ’ͺ Prompt Tuning🟦 Multitask Prompt Tuning

Multitask Prompt Tuning

🟦 This article is rated medium
Reading Time: 3 minutes
Last updated on March 2, 2025

Valeriia Kuka

Multitask Prompt Tuning (MPT) is a parameter-efficient transfer learning approach designed to adapt large language models (LLMs) to a wide range of downstream tasks using shared soft prompts.

In contrast to traditional fine-tuning, MPT leverages prompt tuning, where only a small set of task-specific continuous vectors (soft prompts) are optimized.

Many tasks share overlapping knowledge and common patterns. Learning individual task-specific prompts from scratch for every new task can be redundant and inefficient. This is where Multitask Prompt Tuning (MPT) makes a significant contribution.

MPT combines multitask learning with low-rank adaptation to achieve exceptional efficiency, using as little as 0.035% of model parameters per task.

Core Ideas of MPT

  • Multitask Learning for Prompts: MPT first distills information from multiple source tasks instead of training separate soft prompts for each task. By decomposing each prompt into shared and task-specific components, MPT extracts a single, transferable prompt that encapsulates general knowledge across tasks. This shared prompt can then be adapted to individual target tasks using efficient, low-rank multiplicative updates.

  • Low-Rank Adaptation: The soft prompt matrix often contains redundant information. MPT applies low-rank factorization to the prompt matrix, representing it as the product of two smaller matrices. For example, rather than learning a full prompt matrix XX of size nΓ—dn \times d, MPT factorizes it into UU (size nΓ—rn \times r) and VV (size rΓ—dr \times d), where rr is much smaller than dd. This reduces the number of trainable parameters from nΓ—dn \times d to r(n+d)r(n + d), without compromising performance.

How MPT Works: A Two-Stage Process

The process consists of two main stages:

  1. Knowledge Distillation: MPT trains soft prompts on various source tasks (such as MNLI, QNLI, SST-2) using standard prompt tuning methods. Each task-specific prompt is decomposed into shared and unique components. The system then distills knowledge from these individual prompts into a single, transferable shared prompt that captures common patterns across all tasks. This step reduces redundancy and creates a robust foundation for prompt adaptation.

  2. Task-Specific Adaptation: The shared prompt, containing distilled general-purpose knowledge, is transferred to new tasks. For each target task, MPT applies lightweight, low-rank multiplicative updates to this shared prompt. This adaptation process requires minimal additional parameters (about 0.035% per task) while effectively customizing the prompt for each specific task.

Advantages Over Other Techniques

MPT achieves an optimal balance between efficiency and performance. Here's how it compares to other techniques:

TechniqueTrainable ParametersStrengthsWeaknesses
Full Fine-Tuning100% of model parametersHigh accuracyComputationally expensive, storage-heavy
AdaptersSmall added layersEfficient, good performanceRequires additional computation
BitFitOnly updates biasesVery lightweightLower performance on some tasks
Vanilla Prompt Tuning~0.01% of model parametersEfficient, scalableNeeds careful initialization, slower convergence
SPoTSingle best promptEfficient reuseSuboptimal for unseen tasks
ATTEMPTAttention-based transferStrong performanceUses more parameters than MPT
MPT0.035% per taskBest efficiency-performance tradeoff; effective knowledge sharingRequires initial multitask training

Key Benefits of MPT

  1. Parameter Efficiency: Uses only 0.035% of parameters per task, significantly reducing storage and computation costs
  2. Strong Performance: Achieves high accuracy across diverse benchmarks through effective knowledge transfer
  3. Few-Shot Learning: Performs well with limited training examples due to the shared prompt's generalized knowledge
  4. Deployment Flexibility: Enables serving multiple tasks by swapping task-specific updates while maintaining a single base model

When to Use MPT

Consider using MPT in these scenarios:

  • You need one model to efficiently handle multiple tasks
  • You have limited computational resources and storage
  • Your training data is limited
  • You want to transfer knowledge between classification and generation tasks

Conclusion

Multitask Prompt Tuning (MPT) provides an efficient solution for adapting large language models to multiple tasks. Through shared knowledge distillation and low-rank updates, MPT reduces per-task parameters to 0.035% while maintaining strong performance. This approach makes MPT particularly valuable for resource-constrained environments and multi-task applications.

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.

Footnotes

  1. Wang, Z., Panda, R., Karlinsky, L., Feris, R., Sun, H., & Kim, Y. (2023). Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning. https://arxiv.org/abs/2303.02861 ↩