RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning

Published 25 May 2022 in cs.CL and cs.LG | (2205.12548v3)

Abstract: Prompting has shown impressive success in enabling large pretrained LMs to perform diverse NLP tasks, especially when only few downstream data are available. Automatically finding the optimal prompt for each task, however, is challenging. Most existing work resorts to tuning soft prompt (e.g., embeddings) which falls short of interpretability, reusability across LMs, and applicability when gradients are not accessible. Discrete prompt, on the other hand, is difficult to optimize, and is often created by "enumeration (e.g., paraphrasing)-then-selection" heuristics that do not explore the prompt space systematically. This paper proposes RLPrompt, an efficient discrete prompt optimization approach with reinforcement learning (RL). RLPrompt formulates a parameter-efficient policy network that generates the desired discrete prompt after training with reward. To overcome the complexity and stochasticity of reward signals by the large LM environment, we incorporate effective reward stabilization that substantially enhances the training efficiency. RLPrompt is flexibly applicable to different types of LMs, such as masked (e.g., BERT) and left-to-right models (e.g., GPTs), for both classification and generation tasks. Experiments on few-shot classification and unsupervised text style transfer show superior performance over a wide range of existing finetuning or prompting methods. Interestingly, the resulting optimized prompts are often ungrammatical gibberish text; and surprisingly, those gibberish prompts are transferrable between different LMs to retain significant performance, indicating LM prompting may not follow human language patterns.

Abstract PDF Upgrade to Chat

Citations (279)

View on Semantic Scholar

Summary

The paper proposes a parameter-efficient policy network to systematically optimize discrete text prompts using reward-driven reinforcement learning.
The approach bypasses gradient dependency, improving performance and interpretability across tasks such as few-shot classification and unsupervised text style transfer.
Stabilized reward feedback enhances learning efficiency, indicating potential for scalable and transferable prompt engineering in various NLP applications.

Discrete Text Prompt Optimization via Reinforcement Learning

The paper presents an innovative approach to optimizing discrete text prompts using Reinforcement Learning (RL). This marks a strategic shift from traditional prompt-tuning methodologies that focus on soft prompts to a RL framework tailored for discrete prompt optimization. This new approach addresses limitations pertinent to interpretability, reusability, and gradient accessibility, positioning itself as a valuable enhancement for various NLP tasks involving large LMs.

In recent advancements in NLP, prompting has emerged as a powerful technique for leveraging large pre-trained LMs such as GPT and BERT. These models have demonstrated remarkable aptitude in handling diverse NLP tasks with minimal task-specific data due to prompting. However, determining optimal prompts remains a complex challenge. Traditional soft prompt tuning leverages gradient-based methods but at the expense of limited interpretability and restricted applicability across different LMs, especially when internal gradients are inaccessible (e.g., when using inference-only APIs). On the other hand, discrete prompts, although interpretable and transferable, exhibit cumbersome optimization dynamics. Previous attempts employing enumeration techniques fall short due to their heuristic, non-systematic exploration of the prompt space.

The approach introduced in this study leverages a parameter-efficient policy network to systematically optimize discrete prompts through RL techniques. This policy network generates optimal prompts post-training, utilizing reward-driven signals rather than relying on human supervision. This methodology effectively circumvents the inefficiencies observed in manual prompt engineering and heuristic-based enumeration methods. Moreover, the RL framework employed does not necessitate gradient information from the LMs, eliminating computationally expensive operations often involved in gradient computation.

In addressing the core challenge of reward signal instability typically associated with RL, the authors propose measures to stabilize the reward feedback, thus improving learning efficiency. The exploration results reflect robust performance enhancement in both few-shot classification and unsupervised text style transfer tasks, outperforming various fine-tuning and prompting baseline methods. The resultant prompts are intuitively interpretable and demonstrate transferability across different LMs, signaling commonalities in the underlying structures captured by diverse architectures.

The results bear implicit implications not only at a practical application level but also at a theoretical level, providing insights into the generalization properties of learned prompts across models. Furthermore, the anomalous efficiency and performance observed in apparently gibberish, yet effective, learned prompts encourage a deeper investigation into the internal mechanisms of LMs and their interaction with structured prompt input.

Future exploratory avenues may include further exploitation of transferability properties of prompts, potentially leading to the development of efficient prompt learning techniques using smaller, computationally economical models. This opens pathways towards scalable and more adaptable deployment of LMs across various real-world applications without the necessity for massive computational resources.

Overall, the paper lays significant groundwork for discrete prompt optimization, paving the way for future innovations in efficient and interpretable human-machine instructional interfacing with LMs.

Markdown Report Issue