Emergent Mind

Abstract

LLMs have demonstrated impressive capabilities in mathematical problem solving, particularly in single turn question answering formats. However, real world scenarios often involve mathematical question answering that requires multi turn or interactive information exchanges, and the performance of LLMs on these tasks is still underexplored. This paper introduces MathChat, a comprehensive benchmark specifically designed to evaluate LLMs across a broader spectrum of mathematical tasks. These tasks are structured to assess the models' abilities in multiturn interactions and open ended generation. We evaluate the performance of various SOTA LLMs on the MathChat benchmark, and we observe that while these models excel in single turn question answering, they significantly underperform in more complex scenarios that require sustained reasoning and dialogue understanding. To address the above limitations of existing LLMs when faced with multiturn and open ended tasks, we develop MathChat sync, a synthetic dialogue based math dataset for LLM finetuning, focusing on improving models' interaction and instruction following capabilities in conversations. Experimental results emphasize the need for training LLMs with diverse, conversational instruction tuning datasets like MathChatsync. We believe this work outlines one promising direction for improving the multiturn mathematical reasoning abilities of LLMs, thus pushing forward the development of LLMs that are more adept at interactive mathematical problem solving and real world applications.

Examples of four tasks in MathChat benchmark, showcasing user-assistant dialogue and LLM-generated responses.

Overview

  • MathChat is a benchmark designed to evaluate LLMs on multi-turn mathematical reasoning and instruction-following tasks, addressing an unexplored area in current LLM capabilities.

  • MathChat comprises four tasks: Follow-up QA, Error Correction, Error Analysis, and Problem Generation, each assessing different aspects of multi-turn interactions and reasoning.

  • The study finds that while math-specific LLMs excel in single-turn QA, they struggle with multi-turn tasks. Performance improves with supervised fine-tuning and diversified training data, highlighting the need for more robust dialogue datasets.

A Comprehensive Overview of MathChat: Benchmarking LLMs in Multi-Turn Mathematical Reasoning

Introduction

The paper presents MathChat, a pivotal benchmark designed to evaluate LLMs on multi-turn mathematical reasoning and instruction-following tasks. While existing math-specific LLMs excel in single-turn question-answering (QA) formats, their performance on more complex, multi-turn interactions remains under-explored. MathChat aims to bridge this gap by assessing LLMs through a variety of tasks inspired by real-world applications requiring sustained dialogue and nuanced understanding.

MathChat Benchmark

Structure and Tasks

MathChat introduces a suite of four tasks to challenge LLMs in multi-turn interactions:

  1. Follow-up QA: This task extends single-turn QA into multi-round dialogues. It evaluates an LLM's capacity to engage in deeper reasoning across consecutive questions related to the initial problem.
  2. Error Correction: In this task, the model is presented with an incorrect solution and is required to correct it. This tests an LLM’s ability to understand and rectify errors based on previous responses.
  3. Error Analysis: Distinct from Error Correction, this task requires the model to identify errors in a provided solution, analyze them, and then correct them. This evaluates critical thinking and diagnostic abilities.
  4. Problem Generation: Tasking the LLM to generate new problems and solutions based on given problem-solution pairs, this assesses an LLM's creativity and ability to generate educational content.

Evaluation and Performance Metrics

Performance metrics in MathChat vary by task. For problem-solving tasks (Follow-up QA and Error Correction), performance is measured by the accuracy of final numerical solutions. For instruction-following tasks (Error Analysis and Problem Generation), tasks are evaluated on criteria such as Instruction Following (IF), Error Diagnosis (ED), Solution Accuracy (SA), and Problem Quality (PQ). Scores range from 1 (lowest) to 5 (highest) based on the models’ ability to follow instructions and generate relevant responses.

Experimental Findings

Baseline Performance

The study evaluates several state-of-the-art LLMs using MathChat, revealing that math-specific LLMs, albeit highly specialized for single-turn QA, underperform in multi-turn tasks. Notably, while these models achieve impressive results on single-round QA datasets like GSM8K, their ability to maintain accuracy diminishes significantly in multi-turn dialogues due to a decline in their long-context reasoning capabilities.

Supervised Fine-Tuning (SFT) for Enhancement

To enhance LLM performance, the study explores various supervised fine-tuning strategies. By integrating general-purpose instruction tuning data, such as Alpaca-GPT4 and LIMA, along with MathChat-specific dialogue data, it shows improved model performance in follow-up QA and instruction-following tasks. The resulting models demonstrate a marked improvement in both mathematical accuracy and instruction comprehension.

Implications and Future Directions

Practical Implications

The results of the MathChat benchmark underscore the need for diversified training data to enhance LLM capabilities in real-world applications. For instance, models that excel in multi-turn dialogues can significantly benefit educational tools and interactive mathematical problem-solving assistants, pushing the boundaries of current AI applications.

Theoretical Implications

The findings highlight a crucial open problem: developing LLMs that maintain mathematical problem-solving proficiency while adapting to diverse, multi-turn dialogue requirements. The integration of MathChat$_ dialogue data into training regimes offers a promising direction to mitigate the identified limitations, suggesting that enhanced multi-turn mathematical reasoning can be achieved without sacrificing overall problem-solving accuracy.

Future Developments

Scaling up MathChat$_ dialogue datasets, both in quality and volume, emerges as a necessary step for future research. This approach can further refine the ability of LLMs to handle complex mathematical reasoning within conversational contexts, leading to more robust and adaptable AI systems. Additionally, more comprehensive error filters and targeted problem generation methodologies could refine the dataset's effectiveness, paving the way for future advancements.

Conclusion

The MathChat benchmark addresses a critical gap in LLM evaluation by focusing on multi-turn mathematical reasoning and instruction-following, highlighting the limitations of current math-specialized LLMs. The findings emphasize the potential of integrated dialogue datasets in training more versatile AI systems capable of nuanced understanding and sustained reasoning. This work lays a foundation for future research aimed at developing more generalized mathematical reasoning assistants, contributing to both the theoretical understanding and practical applications of LLMs in real-world contexts.

References

The paper cites a comprehensive list of supporting research, spanning foundational works in mathematical reasoning, advancements in LLM development, and emerging benchmarks for dialogue evaluation. Select references include foundational models like GPT-4, widely recognized datasets like GSM8K, and key studies in error analysis and instruction-following. These references provide a rich context for understanding the contributions and implications of the MathChat benchmark.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.