Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL

Published 13 Sep 2023 in cs.CL, cs.AI, and cs.LG | (2309.06553v4)

Abstract: In this study, we aim to enhance the arithmetic reasoning ability of LLMs through zero-shot prompt optimization. We identify a previously overlooked objective of query dependency in such optimization and elucidate two ensuing challenges that impede the successful and economical design of prompt optimization techniques. One primary issue is the absence of an effective method to evaluate prompts during inference when the golden answer is unavailable. Concurrently, learning via interactions with the LLMs to navigate the expansive natural language prompting space proves to be resource-intensive. To address this, we introduce Prompt-OIRL, which harnesses offline inverse reinforcement learning to draw insights from offline prompting demonstration data. Such data exists as by-products when diverse prompts are benchmarked on open-accessible datasets. With Prompt-OIRL, the query-dependent prompt optimization objective is achieved by first learning an offline reward model. This model can evaluate any query-prompt pairs without accessing LLMs. Subsequently, a best-of-N strategy is deployed to recommend the optimal prompt. Our experimental evaluations across various LLM scales and arithmetic reasoning datasets underscore both the efficacy and economic viability of the proposed approach.

Abstract PDF Upgrade to Chat

Citations (18)

View on Semantic Scholar

Summary

The paper introduces Prompt-OIRL, a method that applies offline inverse reinforcement learning for query-dependent prompt optimization to enhance LLM arithmetic reasoning.
It leverages offline prompt demonstrations and a learned reward model to select the most effective prompt without expensive online evaluations.
Experiments on GSM8K, MAWPS, and SVAMP demonstrate that Prompt-OIRL significantly outperforms query-agnostic methods by improving efficiency and accuracy.

This paper introduces Prompt-OIRL, a novel method for improving the arithmetic reasoning abilities of LLMs through zero-shot prompt optimization. The core idea is to move beyond finding a single, universally best prompt (query-agnostic) and instead optimize prompts based on the specific query being asked (query-dependent).

The authors identify two main challenges with traditional prompt optimization:

Inference Time Evaluation is Hard: Determining the best prompt for a new query during inference is difficult because the correct answer (ground truth) is usually unavailable. Simply trying multiple prompts and generating answers doesn't reveal which answer is correct without extra effort.
Online Prompt Evaluation and Optimization is Expensive: Searching for optimal prompts by interacting with LLMs (online evaluation) is resource-intensive due to the cost of API calls or computational resources, especially given the vast space of possible natural language prompts.

To address these challenges and achieve query-dependent optimization, Prompt-OIRL utilizes Offline Inverse Reinforcement Learning (Offline IRL). The process involves three main steps:

Leveraging Offline Prompt Demonstrations: The method uses existing datasets that are often generated as by-products when researchers benchmark different prompting strategies (like Zero-shot CoT, APE, etc.) on standard tasks (e.g., arithmetic reasoning datasets). These datasets contain (query, prompt, success_label) tuples, where the success label indicates if the prompt led the LLM to the correct answer for that query. This data captures the "preference" of the LLM for certain prompts on certain queries.
Offline Reward Modeling (Inverse RL): An offline reward model, denoted as $\Upsilon_\theta(x, \pi(x))$ , is trained on this demonstration dataset. This model learns to predict the probability that a given prompt $\pi$ will lead to a correct answer for a specific query $x$ , without needing to interact with the LLM or know the ground-truth answer $y^*$ . It takes embeddings of the query and prompt as input. The training uses supervised learning, like minimizing Cross-Entropy loss against the observed success labels $r^{(i,k)}$ from the dataset:

$\mathcal{L}_{\text{CE}} = -\mathbb{E} [ r^{(i,k)} \log \sigma(\Upsilon^{(i,k)}_{\theta}) + (1-r^{(i,k)}) \log(1-\sigma(\Upsilon^{(i,k)}_{\theta})) ]$

The paper finds that gradient boosting models (like XGBoost) work well for $\Upsilon_\theta$ . This learned reward model effectively solves Challenge 1 by providing an offline evaluation mechanism.
Offline Prompt Optimization: During inference, given a new query $x$ , the learned reward model $\Upsilon_\theta$ is used to evaluate a set of candidate prompts $\pi$ . The prompt predicted to have the highest success probability is selected:

$\pi^* = \arg\max_\pi \Upsilon_\theta(x, \pi(x))$

The paper uses a simple "best-of-N" strategy, where N candidate prompts (including known effective prompts and potentially newly generated ones) are evaluated using $\Upsilon_\theta$ , and the best one is chosen to send to the LLM. This step solves Challenge 2 by optimizing the prompt choice offline, minimizing costly LLM interactions during inference.

Experiments were conducted on arithmetic reasoning datasets (GSM8K, MAWPS, SVAMP) using various LLMs (GPT-3.5-turbo, LLaMA-2-7B-Chat, TigerBot-13B-Chat). The results demonstrate that:

Prompt-OIRL successfully achieves the query-dependent objective, significantly improving performance over using the single best prompt found during training (query-agnostic) or using LLM's self-reported confidence.
The learned reward model accurately predicts prompt success offline, outperforming LLM self-criticism baselines in evaluation accuracy and precision.
The approach is highly cost-effective compared to methods requiring online LLM interactions for evaluating multiple prompt candidates at inference time.

In conclusion, Prompt-OIRL offers a practical and efficient framework for optimizing prompts at a query-dependent level by leveraging offline demonstration data and inverse reinforcement learning to train a reward model for offline evaluation and selection.

Markdown