Guiding Large Language Models via Directional Stimulus Prompting (2302.11520v4)
Abstract: We introduce Directional Stimulus Prompting, a novel framework for guiding black-box LLMs toward specific desired outputs. Instead of directly adjusting LLMs, our method employs a small tunable policy model (e.g., T5) to generate an auxiliary directional stimulus prompt for each input instance. These directional stimulus prompts act as nuanced, instance-specific hints and clues to guide LLMs in generating desired outcomes, such as including specific keywords in the generated summary. Our approach sidesteps the challenges of direct LLM tuning by optimizing the policy model to explore directional stimulus prompts that align LLMs with desired behaviors. The policy model can be optimized through 1) supervised fine-tuning using labeled data and 2) reinforcement learning from offline or online rewards based on the LLM's output. We assess our method across summarization, dialogue response generation, and chain-of-thought reasoning tasks. Our experiments demonstrate that the framework consistently improves LLMs' (e.g., ChatGPT, Codex, InstructGPT) performance on these supervised tasks using minimal labeled data. Notably, using just 80 dialogues on the MultiWOZ dataset, our approach enhances ChatGPT's performance by an impressive 41.4%, matching or surpassing some fully supervised start-of-the-art models. Additionally, the instance-specific chain-of-thought prompt generated by our approach improves InstructGPT's reasoning accuracy compared to human-crafted or automatically generated prompts. The code and data are publicly available at \url{https://github.com/Leezekun/Directional-Stimulus-Prompting}.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
- Input-Tuning: Adapting unfamiliar inputs to frozen pretrained models. arXiv preprint arXiv:2203.03131, 2022.
- METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72, 2005.
- A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
- Variations of the similarity function of textrank for automated summarization. arXiv preprint arXiv:1602.03606, 2016.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- MultiWOZ – a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278, 2018.
- With little power comes great responsibility. arXiv preprint arXiv:2010.06595, 2020.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019.
- RLPrompt: Optimizing discrete text prompts with reinforcement learning. arXiv preprint arXiv:2205.12548, 2022.
- BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- MultiWOZ 2.1: A consolidated multi-domain dialogue dataset with state corrections and state tracking baselines. arXiv preprint arXiv:1907.01669, 2019.
- News summarization and evaluation in the era of GPT-3. arXiv preprint arXiv:2209.12356, 2022.
- Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964, 2020.
- Thinking about GPT-3 in-context learning for biomedical IE? Think again. arXiv preprint arXiv:2203.08410, 2022.
- Galaxy: A generative pre-trained model for task-oriented dialog with semi-supervised learning and explicit policy injection. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10749–10757, 2022.
- spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017.
- A simple language model for task-oriented dialogue. Advances in Neural Information Processing Systems, 33:20179–20191, 2020.
- Are LLMs all you need for task-oriented dialogue? arXiv preprint arXiv:2304.06556, 2023.
- Domain state tracking for a simplified dialogue system. arXiv preprint arXiv:2103.06648, 2021.
- CTRL: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019.
- Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024, 2022.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- GeDi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367, 2020.
- Reinforcement learning based curriculum optimization for neural machine translation. arXiv preprint arXiv:1903.00041, 2019.
- Tackling error propagation through reinforcement learning: A case of greedy dependency parsing. arXiv preprint arXiv:1702.06794, 2017.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004.
- Mintl: Minimalist transfer learning for task-oriented dialogue systems. arXiv preprint arXiv:2009.12005, 2020.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.
- Dexperts: Decoding-time controlled text generation with experts and anti-experts. arXiv preprint arXiv:2105.03023, 2021.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.
- Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610, 2022a.
- Quark: Controllable text generation with reinforced unlearning. arXiv preprint arXiv:2205.13636, 2022b.
- Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404–411, 2004.
- Gpt-3 models are poor few-shot learners in the biomedical domain. arXiv preprint arXiv:2109.02555, 2021.
- Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016.
- Training parsers by inverse reinforcement learning. Machine Learning. 2009 Dec; 77: 303-37., 2009.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
- A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
- SOLOIST: Building task bots at scale with transfer learning and machine teaching. Transactions of the Association for Computational Linguistics, 9:807–824, 2021.
- Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
- Post, M. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771, 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241, 2022.
- Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7, 2021a.
- Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7, 2021b.
- Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413, 2016.
- BLOOM: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Multi-task pre-training for plug-and-play task-oriented dialogue system. arXiv preprint arXiv:2109.14739, 2021.
- Black-box tuning for language-model-as-a-service. In International Conference on Machine Learning, pp. 20841–20855. PMLR, 2022.
- Follow the wisdom of the crowd: Effective text generation via minimum bayes risk decoding. arXiv preprint arXiv:2211.07634, 2022.
- LaMDA: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- Spot: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904, 2021.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022b.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- UBAR: Towards fully end-to-end task-oriented dialog system with GPT-2. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 14230–14238, 2021.
- Democratizing access to large-scale language models with opt-175b. Meta AI, 2022.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
- Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848, 2023a.
- TEMPERA: Test-time prompt editing via reinforcement learning. In International Conference on Learning Representations, 2023b.
- Task-oriented dialog systems that consider multiple appropriate responses under the same context. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9604–9611, 2020.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.