Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together (2407.10930v2)

Published 15 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: NLP systems are increasingly taking the form of sophisticated modular pipelines, e.g., Retrieval Augmented Generation (RAG), where each module may involve a distinct LLM (LM) and an associated prompt template. These compound systems often lack intermediate labels or gradient flow to optimize each module, making their end-to-end optimization challenging. Here we seek strategies to optimize both the module-level LM weights and the associated prompt templates of such systems to maximize a downstream task metric. We propose for the first time combining the weight and prompt optimization strategies to optimize a modular LM pipeline by alternating between the two to get the same LM to teach itself. In experiments with multi-hop QA, mathematical reasoning, and feature-based classification using mistral-7b, llama-2-7b, and llama-3-8b, these BetterTogether strategies optimizing the weights and prompts of a pipeline together outperform directly optimizing weights alone and prompts alone by up to 60% and 6%, respectively, on average across LMs and tasks. BetterTogether optimizer is released in DSPy at http://dspy.ai

Citations (5)

View on Semantic Scholar

Summary

The paper demonstrates that alternating between prompt optimization and weight fine-tuning substantially improves NLP pipeline performance, with up to 78% accuracy gains on certain tasks.
It leverages the DSPy framework to modularize language models into prompts and weights, addressing challenges where intermediate stage labels are unavailable.
Experimental evaluations on datasets like HotPotQA and GSM8K confirm that combined strategies outperform both prompt-only and weight-only optimization methods.

Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together

This paper addresses the problem of optimizing multi-stage NLP pipelines composed of various LLMs (LMs) and prompting strategies. It proposes a combined approach of alternating between fine-tuning model weights and optimizing prompts in order to improve pipeline performance, especially in scenarios where gold labels for intermediate steps are unavailable.

Introduction

The paper highlights the growing complexity in NLP systems, driven by the integration of multiple LMs into pipeline architectures for tasks such as retrieval-augmented generation and multi-hop reasoning. These LM Programs offer researchers the ability to modularize tasks, allowing models to focus on simpler subtasks, thus improving performance. The authors introduce an approach built on the DSPy framework, which defines a program as a function $\Phi$ consisting of language modules $M$ . Each module involves a learned natural language transformation prompt $\pi$ and LM weights $\theta$ . The optimization task is to maximize the expected performance of $\Phi$ by updating each module's $\pi$ and $\theta$ .

Problem Statement

The optimization problem centers around a challenging scenario where intermediate stage labels are absent. With NLP systems formalized as modular LM invocations, the task is to determine the best-sequence of prompt templates $\pi$ and LM weights $\theta$ that maximize performance over a given set of inputs $X$ and an evaluation metric $\mu$ . The objective is expressed as maximizing the function:

$\argmax_{\Theta, \Pi} \frac{1}{|X|} \sum_{(x, m) \in X} \mu(\Phi_{\langle \Theta, \Pi \rangle}(x), m)$

This formulation is challenging due to the non-differentiability of $\Phi$ and the lack of intermediate labels.

Alternating Prompt and Weight Optimization

The proposed algorithm, dubbed BetterTogether, alternates between optimizing prompts and fine-tuning weights. First, initial prompts are refined using bootstrapped program traces, followed by weight fine-tuning based on these prompts. A final round of prompt optimization then fine-tunes the system further. This alternating strategy is hypothesized to leverage the unique strengths of prompt optimization (improved data generation) and weight optimization (enhanced model capability).

Experimental Evaluation

The experimental component tests the hypothesis using datasets like HotPotQA, GSM8K, and Iris, across a variety of LMs including mistral-7b-instruct-v0.2 and llama variants. The paper compares strategies such as:

Vanilla zero-shot
Prompt-only optimization
Weight-only optimization
Sequential permutations of prompt and weight optimization

The experiments consistently show that strategies employing both prompt optimization and weight fine-tuning outperform others, achieving up to 78% relative improvements in accuracy for some tasks and model combinations.

Results and Discussion

The results strongly suggest that alternating between prompt and weight optimization harnesses the complementary strengths of both techniques. Notably, although fine-tuning is computationally intensive, when combined with prompt optimization, it leads to substantial improvements in LM performance. This underscores the importance of using combined strategies for multi-module NLP systems where modular optimizations offer clear benefits.

Limitations

The authors acknowledge that while results are promising, they are based on specific LM architectures and tasks, and may not generalize universally without additional verification. Furthermore, the paper's fine-tuning approach is restricted to LoRA methods, and other fine-tuning mechanisms might yield different outcomes.

Conclusion

The research demonstrates that systematic alternation between prompt optimization and weight fine-tuning, as implemented in the DSPy framework, offers a powerful approach to improving NLP pipeline performance in label-scarce environments. This contribution holds potential implications for the broader application of LMs across complex AI systems, encouraging ongoing exploration and deployment of these complementary optimization strategies in NLP.