Emergent Mind

Positional Description Matters for Transformers Arithmetic

(2311.14737)
Published Nov 22, 2023 in cs.CL , cs.AI , and cs.LG

Abstract

Transformers, central to the successes in modern Natural Language Processing, often falter on arithmetic tasks despite their vast capabilities --which paradoxically include remarkable coding abilities. We observe that a crucial challenge is their naive reliance on positional information to solve arithmetic problems with a small number of digits, leading to poor performance on larger numbers. Herein, we delve deeper into the role of positional encoding, and propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently. We investigate the value of these modifications for three tasks: (i) classical multiplication, (ii) length extrapolation in addition, and (iii) addition in natural language context. For (i) we train a small model on a small dataset (100M parameters and 300k samples) with remarkable aptitude in (direct, no scratchpad) 15 digits multiplication and essentially perfect up to 12 digits, while usual training in this context would give a model failing at 4 digits multiplication. In the experiments on addition, we use a mere 120k samples to demonstrate: for (ii) extrapolation from 10 digits to testing on 12 digits numbers while usual training would have no extrapolation, and for (iii) almost perfect accuracy up to 5 digits while usual training would be correct only up to 3 digits (which is essentially memorization with a training set of 120k samples).

Accuracy of models on repeated data in digit reversion, trained from random initial weights.

Overview

  • Transformers struggle with arithmetic, particularly with tasks such as multiplication and addition, often failing to accurately process large numbers.

  • The study identifies transformers' reliance on absolute positional encoding as a key obstacle and explores modifications in encoding and number representation to improve performance.

  • Experiments with small transformer models and a moderate-sized dataset showed nearly perfect multiplication ability for up to 12 digits and better extrapolation to five-digit addition.

  • The researchers introduced new methods like random spaces and randomized embedding in positional encoding to help transformers generalize arithmetic beyond trained lengths.

  • Integrating arithmetic skills with language tasks in transformers is challenging, but the study found improved integration through methods like random space and alternative positional encoding.

Transformers, like those underlying GPT-4, have transformed natural language processing, yet they stumble when performing arithmetic tasks. This research explore the inherent difficulties transformers face with arithmetic, particularly multiplication and addition, and offers innovative solutions to enhance their computational capabilities.

The core obstacle identified is transformers' reliance on absolute positional encoding, which can impede their ability to accurately process larger numbers beyond the training set's scope. The study suggests modifying the positional encoding or the representation of arithmetic itself to tackle this problem.

Positively, with relatively small models (100M parameters) and a modest dataset size (300k samples), significant strides were made. By altering the positional encoding with randomized embedding and adapting how numbers are presented in the dataset (e.g., reversing digits or adding zero-padding), the models demonstrated nearly perfect multiplication for up to 12 digits and improved extrapolation to five-digit addition tasks.

The researchers also focused on two main aspects to help transformers generalize beyond trained lengths in arithmetic tasks. The first involved diminishing reliance on absolute positions by adding random spaces or repeating information in data sequences; the second probed alternative positional encodings. The outcome was a new positional encoding method--randomized embedding--which effectively improved length generalization.

Lastly, the paper addressed the integration of arithmetic within natural language contexts. Mixing arithmetic data directly with natural language data posed challenges due to differing formats and dependencies. By applying methods such as random space and alternative positional encoding strategies, the research demonstrated that transformers could indeed integrate arithmetic skills into language tasks more effectively.

In all, the study's results are noteworthy as they demonstrate that even modest-sized transformer models can tackle complex arithmetic tasks, hinting at future directions for integrating numeracy into language models more seamlessly. While the research did not achieve perfect accuracy on very large numbers, it laid out the foundation for further innovation in this area. Additionally, the proposed techniques have implications for a broader range of tasks beyond arithmetic, suggesting avenues for future exploration into how position information is encoded in neural networks.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.