Emergent Mind

Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost

(2407.19825)
Published Jul 29, 2024 in cs.CL and cs.AI

Abstract

Today's LLMs can solve challenging question-answering tasks, and prompt engineering techniques, such as chain-of-thought (CoT), have gained attention for enhancing the explanation and correctness of outputs. Nevertheless, models require significant time to generate answers augmented with lengthy reasoning details. To address this issue, this paper analyzes the impact of output lengths on LLM inference pipelines and proposes novel metrics to evaluate them in terms of \textit{correct conciseness}. It also examines the impact of controlling output length through a refined prompt engineering strategy, Constrained-CoT (CCoT), which encourages the model to limit output length. Experiments on pre-trained LLMs demonstrated the benefit of the proposed metrics and the effectiveness of CCoT across different models. For instance, constraining the reasoning of LLaMA2-70b to 100 words improves the accuracy from 36.01\% (CoT) to 41.07\% (CCoT) on the GSM8K dataset, while reducing the average output length by 28 words.

Response time vs. output length for three LLMs across various datasets and downstream tasks.

Overview

  • Nayab et al. investigate how the length of outputs generated by LLMs influences their inference times and accuracy, introducing the Constrained-Chain-of-Thought (CCoT) technique to promote concise and correct responses.

  • The paper introduces three novel metrics—Hard-$k$ Concise Accuracy (HCA), Soft-$k$ Concise Accuracy (SCA), and Consistent Concise Accuracy (CCA)—to evaluate and improve the conciseness and correctness of LLM outputs.

  • Experimental results on models like Vicuna-13b, Falcon-40b, and Llama2-70b demonstrate that concise prompting can improve both the accuracy and efficiency of LLMs, particularly with the larger models showing significant gains.

Impact of Output Length on LLM Reasoning and Cost

In their paper titled "Concise Thoughts: Impact of Output Length on LLM Reasoning and Cost," Nayab et al. investigate the relationship between the lengths of outputs generated by LLMs and their inference times, emphasizing the need for conciseness to improve both efficiency and accuracy. The authors introduce a refined prompt engineering technique, Constrained-Chain-of-Thought (CCoT), which explicitly encourages models to generate concise answers while preserving the correctness inherent in chain-of-thought (CoT) prompting. This paper provides significant advancements in evaluating and controlling LLM outputs to ensure they are both accurate and efficiently generated.

Key Contributions

Novel Metrics for Conciseness:

The paper introduces three new metrics designed to evaluate the correctness of LLM outputs with an emphasis on their conciseness:

  1. Hard-$k$ Concise Accuracy (HCA): Measures the fraction of correct outputs that do not exceed a specified length $k$. This metric is useful when strict adherence to length constraints is sought.
  2. Soft-$k$ Concise Accuracy (SCA): Generalizes HCA by introducing a penalty term that decreases exponentially using a decay factor $\alpha$. This metric allows some tolerance for outputs exceeding $k$ slightly.
  3. Consistent Concise Accuracy (CCA): Adds an additional layer by also considering the variation in lengths of the outputs. This metric promotes uniformity in the length of the generated responses.

Constrained-CoT (CCoT) Prompting:

The authors propose a new prompt engineering strategy, CCoT, which modifies the traditional CoT prompt by explicitly requesting the LLM to limit the length of its reasoning. This approach aims to combine the step-by-step correctness afforded by CoT with the efficiency of more concise answers.

Experimental Setup and Findings

The authors evaluated several pre-trained LLMs including Vicuna-13b, Falcon-40b, Falcon-7b, Llama2-7b, and Llama2-70b using the GSM8K dataset, which is focused on mathematical problem-solving tasks. Various lengths constraints were tested to measure the effects on accuracy and inference time.

Efficiency and Accuracy Gains

The experiments revealed that:

  • LLaMA2-70b and Falcon-40b: Both models showed improvements in accuracy and reduced inference times under CCoT prompting. Specifically, LLaMA2-70b saw an increase in accuracy from 36.01\% (CoT) to 41.07\% (CCoT-100) while also reducing the average output length.
  • Smaller Models: Models like Falcon-7b and Llama2-7b were less effective in leveraging CCoT and showed mixed results regarding accuracy and inference times.

Output Length Control

Upon reviewing the output length distribution, it was noted that:

  • ANSIthe LLMs generally produced shorter outputs under CCoT prompting, though not always strictly within the specified limits. This ability to approximate the length constraints while maintaining accuracy underscores the dynamic nature of these models.

Implications and Future Directions

The introduction of metrics such as HCA, SCA, and CCA provides a more nuanced understanding of model performance, extending beyond mere accuracy to include efficiency and consistency in response length. This multifaceted evaluation is pivotal for real-world applications where timely and concise responses are critical.

The findings indicate that restraining output length can be beneficial not just for model efficiency but potentially for accuracy as well. The use of CCoT prompts could feasibly integrate into fine-tuning processes to make models inherently capable of better managing output lengths.

Future research could explore:

  • Integration with Training: Embedding these conciseness metrics into the training regime to cultivate models better adapted to length constraints.
  • Extensions to Other Tasks: Applying CCoT beyond the GSM8K dataset to other traditional NLP tasks, assessing the generalizability of the approach.
  • Mitigating Hallucinations: Analyzing how conciseness impacts the phenomenon of hallucinations, where models generate plausible but incorrect information.

In conclusion, the paper offers substantial advancements in prompt engineering and performance metrics, presenting a balanced approach to improving LLMs' practical applicability in time-sensitive and context-specific deployments.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube