P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks

Published 14 Oct 2021 in cs.CL | (2110.07602v3)

Abstract: Prompt tuning, which only tunes continuous prompts with a frozen LLM, substantially reduces per-task storage and memory usage at training. However, in the context of NLU, prior work reveals that prompt tuning does not perform well for normal-sized pretrained models. We also find that existing methods of prompt tuning cannot handle hard sequence labeling tasks, indicating a lack of universality. We present a novel empirical finding that properly optimized prompt tuning can be universally effective across a wide range of model scales and NLU tasks. It matches the performance of finetuning while having only 0.1%-3% tuned parameters. Our method P-Tuning v2 is an implementation of Deep Prompt Tuning \cite{li2021prefix,qin2021learning} optimized and adapted for NLU. Given the universality and simplicity of P-Tuning v2, we believe it can serve as an alternative to finetuning and a strong baseline for future research.Our code and data are released at https://github.com/THUDM/P-tuning-v2.

Abstract PDF Upgrade to Chat

Citations (716)

View on Semantic Scholar

Summary

The paper demonstrates that deep prompt tuning with continuous prompts across multiple layers can rival full fine-tuning in varied NLU tasks.
It uses reparameterization and multi-task learning to optimize performance on models ranging from 300M to 10B parameters.
The approach cuts training costs by tuning only 0.1%-3% of task-specific parameters while maintaining competitive accuracy.

An Analysis of P-Tuning v2: Prompt Tuning for Efficient Natural Language Understanding

Introduction

The paper "P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks" by Xiao Liu et al. focuses on the optimization and application of prompt tuning for Natural Language Understanding (NLU). This paper builds upon the limitations observed in traditional fine-tuning and previous prompt tuning methodologies, proposing a novel approach that boasts universality and efficiency across various model scales and NLU tasks.

Background and Motivation

In the field of pretrained LLMs (PLMs) like BERT, RoBERTa, and GPT, fine-tuning the entire set of parameters has been a dominant methodology for adapting these models to specific tasks. However, fine-tuning is computationally heavy and demands substantial storage, which scales with the number of tasks. Prompt tuning offers a compelling alternative by freezing the LLM parameters and tuning only a small number of task-specific parameters via continuous prompts. Despite its promise, previous attempts at prompt tuning have shown limitations, especially for models with fewer parameters and harder sequence labeling tasks.

P-Tuning v2: Core Methodology

P-Tuning v2 advances the concept of prompt tuning by integrating several key improvements:

Deep Prompt Tuning: The approach leverages the idea of adding continuous prompts at multiple layers of the pretrained model rather than just the input layer. This enhancement allows a greater number of tunable parameters and more direct impact on the model's predictions.
Optimization and Implementation:
- Reparameterization: The paper explores the use of reparameterization (e.g., MLP) for transforming trainable embeddings. Interestingly, the utility of this technique varies across different tasks.
- Prompt Length: The optimal length for prompts is empirically found to vary across tasks, with simpler classification tasks benefiting from shorter prompts and more complex sequence labeling tasks preferring longer ones.
- Multi-task Learning: Jointly optimizing multiple tasks through shared continuous prompts before fine-tuning for individual tasks provides better initialization and enhances performance.
- Classification Head: Contrary to traditional methods using a language modeling head with verbalizers, P-Tuning v2 employs a randomly-initialized classification head for more straightforward and effective adaptation.

Experimental Results

The empirical evaluation covers a broad spectrum of model sizes and tasks:

Model Scales: Experiments on models ranging from 300M to 10B parameters (e.g., BERT-large, RoBERTa-large, GLM-xlarge/xxlarge) demonstrate that P-Tuning v2 consistently matches or rivals the performance of full fine-tuning, regardless of model scale.
Task Diversity: The paper benchmarks P-Tuning v2 across various competitions like GLUE and SuperGLUE, covering simple classification tasks, multiple-choice tasks, and hard sequence labeling tasks (NER, extractive QA, SRL). P-Tuning v2 achieves performance on par with or better than fine-tuning across these diverse tasks and datasets.

Key Findings and Implications

The robust performance of P-Tuning v2 across different model scales and NLU tasks indicates several important implications:

Efficiency: With only 0.1%-3% of the task-specific parameters of fine-tuning, P-Tuning v2 offers significant reductions in training time, memory consumption, and storage requirements.
Scalability: P-Tuning v2’s capability to handle models from 300M to 10B parameters equally well offers valuable flexibility for deploying models under various resource constraints without sacrificing performance.
Versatility: Its applicability across simple and hard sequence tasks positions P-Tuning v2 as a viable and strong baseline for a wide range of future research in NLU.

Future Directions

P-Tuning v2 sets the stage for exciting future research directions. Potential avenues include:

Extending to Other Domains: Applying P-Tuning v2’s methodology beyond NLU to areas like natural language generation (NLG) or multimodal tasks involving text and vision.
Exploring Prompt Structures: Investigating more sophisticated prompt structures and reparameterization techniques to enhance the adaptability and performance of prompt tuning for even more complex tasks.
Optimizing Multi-task Learning: Further refining multi-task learning strategies to maximize the efficiency and performance gains from shared continuous prompts.

Conclusion

P-Tuning v2 marks a significant step forward in the prompt tuning paradigm, presenting a highly efficient, universally applicable method for NLU tasks. Its empirical validation across multiple scales and task types underscores the potential for prompt tuning to serve as a strong alternative to traditional fine-tuning, offering avenues for future research to build upon and extend these findings.

For more details, the code and data are available at https://github.com/THUDM/P-tuning-v2.

Markdown Report Issue