Are NLP Models really able to Solve Simple Math Word Problems? (2103.07191v2)

Published 12 Mar 2021 in cs.CL

Abstract: The problem of designing NLP solvers for math word problems (MWP) has seen sustained research activity and steady gains in the test accuracy. Since existing solvers achieve high performance on the benchmark datasets for elementary level MWPs containing one-unknown arithmetic word problems, such problems are often considered "solved" with the bulk of research attention moving to more complex MWPs. In this paper, we restrict our attention to English MWPs taught in grades four and lower. We provide strong evidence that the existing MWP solvers rely on shallow heuristics to achieve high performance on the benchmark datasets. To this end, we show that MWP solvers that do not have access to the question asked in the MWP can still solve a large fraction of MWPs. Similarly, models that treat MWPs as bag-of-words can also achieve surprisingly high accuracy. Further, we introduce a challenge dataset, SVAMP, created by applying carefully chosen variations over examples sampled from existing datasets. The best accuracy achieved by state-of-the-art models is substantially lower on SVAMP, thus showing that much remains to be done even for the simplest of the MWPs.

Citations (667)

View on Semantic Scholar

Summary

The paper demonstrates that NLP models achieve high accuracy on simple math word problems by exploiting shallow heuristics rather than true mathematical understanding.
It reveals that models can correctly answer questions by leveraging superficial text features even when the actual mathematical query is removed from the input.
The introduction of the robust SVAMP benchmark significantly lowers performance, highlighting the need for models with deeper semantic and mathematical reasoning.

Analysis of NLP Models on Simple Math Word Problems

This paper critically examines the proficiency of current NLP models in solving elementary level Math Word Problems (MWPs). The authors focus on questions taught in fourth grade and below, highlighting an intriguing insight: despite high accuracy on existing benchmarks, models often rely on shallow heuristics rather than genuine mathematical reasoning.

Key Findings

The paper provides compelling evidence that models succeed in simple MWPs by exploiting simplistic patterns in the data. Notably, the performance of these models remains remarkably high even when the mathematical question is excluded from the input, indicating a reliance on superficial text features. Additionally, treating MWPs as bag-of-words dramatically reduces the complexity, yet still yields high accuracy.

Introduction of SVAMP

To address these concerns, the paper introduces a new benchmark called SVAMP, specifically designed to be more robust against heuristic exploitation. SVAMP modifies existing problems subtly yet meaningfully, rendering the simplistic patterns used by models less effective. The authors report significantly lower accuracy when state-of-the-art models tackle SVAMP, underscoring the necessity for improved model robustness.

Implications for Research and Practice

Benchmark Limitations: This paper reveals potential deficiencies in current benchmarks, urging the community to question the "solved" status of elementary MWPs and reconsider the efficacy of existing datasets in evaluating true model understanding.
Model Development: The findings motivate the development of models that go beyond pattern recognition and incorporate deeper semantic and mathematical reasoning capabilities.
Dataset Diversity: The creation of SVAMP points to an essential need for dataset diversity that challenges models in nuanced ways, making them less prone to exploiting data artifacts.

Future Directions

The authors suggest that further investigations should address:

Enhancing model architectures to focus on reasoning and critical understanding rather than pattern recognition.
Expanding the scope and complexity of datasets in controlled manners that push models towards genuine comprehension.
Exploring alternative evaluation metrics that can capture a model’s deeper understanding beyond mere accuracy.

In conclusion, this paper points out the critical gap in current models’ ability to robustly solve even simple MWPs, suggesting a reconsideration of both algorithmic approaches and benchmark datasets within the research community. The introduction of SVAMP serves as a valuable tool for future research aiming at more reliable evaluation of NLP models' capabilities in problem-solving contexts.

PDF Markdown