Language Models Do Hard Arithmetic Tasks Easily and Hardly Do Easy Arithmetic Tasks (2406.02356v1)

Published 4 Jun 2024 in cs.LG, cs.AI, and cs.CL

Abstract: The ability (and inability) of LLMs to perform arithmetic tasks has been the subject of much theoretical and practical debate. We show that LLMs are frequently able to correctly and confidently predict the first digit of n-digit by m-digit multiplication tasks without using chain of thought reasoning, despite these tasks require compounding operations to solve. Simultaneously, LLMs in practice often fail to correctly or confidently predict the last digit of an n-digit by m-digit multiplication, a task equivalent to 1-digit by 1-digit multiplication which can be easily learned or memorized. We show that the latter task can be solved more robustly when the LLM is conditioned on all of the correct higher-order digits, which on average increases the confidence of the correct last digit on 5-digit by 5-digit multiplication tasks using Llama 2-13B by over 230% (0.13 to 0.43) and Mistral-7B by 150% (0.22 to 0.55).

Summary

The paper reveals a paradox where LLMs excel in predicting first digits of n-digit multiplications while underperforming on simpler last-digit tasks.
The study employs Monte Carlo Dropout to quantify uncertainty, achieving over 230% confidence gains in Llama 2-13B and 150% in Mistral-7B.
The findings suggest that current LLM training may favor computational shortcuts, urging improvements in model design for reliable multi-step reasoning.

Analysis of LLMs on Arithmetic Tasks

The paper authored by Gambardella et al. explores the performance of LLMs on arithmetic tasks, presenting findings that challenge conventional expectations of these models' capabilities. While LLMs are known for their broad applicability across diverse language tasks, this paper uniquely reveals their paradoxical behavior regarding arithmetic tasks, such as multiplication.

Key Findings

The research elucidates a counterintuitive observation where LLMs, despite their complexity, struggle with tasks as straightforward as predicting the last digit of a multiplication operation, which should theoretically be trivial, equivalent to 1-digit by 1-digit multiplication. However, they exhibit strong performance in predicting the first digit of n-digit by m-digit multiplication tasks, a computationally more demanding task, without decomposing the problem into multiple steps. Using models like Llama 2-13B and Mistral-7B, the paper shows a substantial increase in prediction confidence by conditioning on the correct higher-order digits, particularly highlighting confidence increases of over 230% for Llama 2-13B and 150% for Mistral-7B.

Experimental Approach

The authors employ Monte Carlo Dropout (MC Dropout) to quantify the uncertainty in LLM predictions. This involves interpreting dropout-based LLMs as Bayesian neural networks to gauge their confidence levels during arithmetic computation. The experimental framework evaluates different models, underscoring specific prediction patterns through carefully designed experiments, including unconditional and conditional number generation tasks. The experiments also include ablations over digit lengths to assess generalization across different scales of arithmetic complexity.

Discussion and Implications

The paper provides insight into LLMs' computational shortcuts, perhaps indicative of their training mechanisms where gradient descent optimizes for apparent "shortcuts." The divergence between theoretical computational expectations and empirical observations opens inquiries into LLMs' internal processes and reasoning. The findings suggest that the issues with last-digit prediction relate to the inherent nature of autoregressive processes, particularly error compounding during string generation. The paper's implications are significant for models' reliability in applications requiring higher-order arithmetic reasoning or multi-step logical chaining.

Future Directions

The paper indicates a need for more expansive research into LLM properties that allow for these computational discrepancies, suggesting potential improvements in architectural designs that better capture elementary computational tasks. It also hints at a broader scope for hallucination detection in LLM outputs: leveraging internal state distinctions could enhance prediction accuracies or, at least, provide markers of potential errors.

Overall, the research provides a critical touchstone for evaluating LLM capabilities beyond general language tasks, emphasizing the nuances of their computational prowess and limits. As neural network research progresses, studies like this one will be significant in steering the direction of model training paradigms and architecture design, particularly as the pursuit of models with open weights grows to accommodate scientific exploration.

PDF Markdown