# The Power of Scale for Parameter-Efficient Prompt Tuning - Brian Lester, Rami Al-Rfou, Noah Constant - Google Research - EMNLP 2021 ## Abstract *Prompt tuning* is a simple yet effective mechanism for learning “soft prompts” that condition frozen language models to perform specific downstream tasks. Soft prompts are learned through backpropagation and can be tuned to incorporate signals from any number of labeled examples. This “soft prompt” is trained end-to-end and can condense the signal from a full labeled dataset. We [release the code and model checkpoints](https://github.com/google-research/prompt-tuning) to reproduce our experiments. Fine-tuning vs. (**soft**) prompt tuning vs. (**hard**) prompt engineering/optimization: ![](Tuning%20Techniques%20Comparison.png) Key contributions are: 1. Proposing prompt tuning and showing its competitiveness with model tuning in the regime of large language models. 2. Ablating many design choices, and showing quality and robustness improve with scale. 3. Showing prompt tuning outperforms model tuning on domain shift problems. 4. Proposing “prompt ensembling” and showing its effectiveness. ## Method We freeze the entire pre-trained model and only allow an additional $k$ tunable tokens per downstream task to be prepended to the input text. We used three possible ways to initialize the prompt representations. 1. The simplest is to train from scratch, using random initialization. 2. A more sophisticated option is to initialize each prompt token to an embedding drawn from the model’s vocabulary. 3. For classification tasks, a third option is to initialize the prompt with embeddings that enumerate the output classes. Initializing the prompt with the embeddings of the valid target tokens should prime the model to restrict its output to the legal output classes. Normally, prompting is done by prepending a series of tokens, $P$, to the input $X$, such that the model maximizes the likelihood of the correct $Y$, $Pr_θ(Y |[P; X])$, while keeping the model parameters, $θ$, fixed. Finding an optimal prompt thus requires the selection of prompt tokens, through either manual search or non-differentiable search methods. Our models are trained to maximize the probability of $Y$, but only the prompt parameters $P_e$ are updated, the model parameters $θ$ remain frozen. Our soft-prompts are represented as parameter $P_e ∈ \mathbb{R}^{p×e}$, where $p$ is the length of the prompt and $e$ is the dimension of the embedding space. Prompt tuning removes the restriction that the prompt $P$ be parameterized by the model's token embeddings table which is part of $θ$; Instead, the prompt has its own dedicated parameters, $θ_P$, that can be updated. Our new conditional generation is now $Pr_{θ;θ_P}(Y |[P; X])$ and can be trained by maximizing the likelihood of $Y$ via backpropagation, while only applying gradient updates to $θ_P$. ## Evaluation We are the first to show that prompt tuning alone (with no intermediate-layer prefixes or task-specific output layers) is sufficient to be competitive with fine-tuning (here: "model tuning"). ![](Prompt%20tuning%20is%20competitive%20with%20fine-tuning%20on%20large%20models.png) Impact of Ablations of various hyperparameters on prompt tuning performance. In our “default” configuration (green X), quality improves stably with model size. Across all ablations, the largest (XXL) model is the most robust to hyperparameter choice. The shorter the prompt, the fewer new parameters must be tuned; We find that XXL performs well even with single-token prompt. ![](Prompt%20Tuning%20Ablation%20Study.png) To test the interpretability of our learned soft prompts, we compute the nearest neighbors to each prompt token from the frozen model’s vocabulary. We use cosine distance between the vocabulary embedding vector and the prompt token representation as the similarity metric. We observe that for a given learned prompt token, the top-5 nearest neighbors form tight semantic clusters. For example, we see lexically similar clusters such as *{ Technology / technology / Technologies / technological / technologies }*, as well as more diverse but still strongly related clusters such as *{ entirely / completely / totally / altogether / 100% }*. The nature of these clusters suggests that the prompts are in fact learning “word-like” representations. ## Related Work Li and Liang (2021) propose “prefix tuning”and show strong results on generative tasks. This method freezes the model parameters and backpropagates the error during tuning to prefix activations prepended to each layer in the encoder stack, including the input layer. - Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. 59th Annual Meeting of the ACL and the 11th International Joint Conference on NLP, pages 4582–4597. Hambardzumyan et al. (2021) simplify this recipe by restricting the trainable parameters to the input and output subnetworks of a masked language model, and show reasonable results on classifications tasks. - Karen Hambardzumyan, Hrant Khachatrian, and Jonathan May. 2021. WARP: Word-level Adversarial ReProgramming. 59th Annual Meeting of the ACL and the 11th International Joint Conference on NLP, pages 4921–4933. ## Conclusions We retain the efficient serving benefits of frozen models. On zero-shot domain transfer, we found that prompt tuning leads to improved generalization. By capturing the task definition in the prompt while keeping the generalist parameters fixed, we are able to achieve better resilience to domain shifts. Model tuning may be over-parameterized and more prone to overfit the training task, to the detriment of similar tasks in different domains.