Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning (2310.09430v5)

Published 13 Oct 2023 in cs.CL and cs.AI

Abstract: LLMs, such as LLaMA, Alpaca, Vicuna, GPT-3.5 and GPT-4, have advanced the performance of AI systems on various natural language processing tasks to human-like levels. However, their generalisation and robustness when performing logical reasoning has not been sufficiently assessed. To comprehensively evaluate this ability, we develop three new logical reasoning datasets named "ReClor-plus", "LogiQA-plus" and "LogiQAv2-plus" that extend standard logical reasoning datasets to evaluate the robustness of the LLM's reasoning. For each, we create three subsets: the first with randomly shuffled options, the second with the correct choices replaced by "none of the other options is correct", and the third with a combination of shuffling and substitution. Experiments on these datasets show that these simple augmentations greatly hinder the models' performance. Despite their high performance on the original publicly available datasets, we find that all models perform poorly on these newly constructed datasets. We also demonstrate that introducing task variations into the training set can markedly improve the model's performance on both the original and our developed datasets. Finally, we show that applying logic-driven data augmentation for fine-tuning and prompting can enhance generalisation in both discriminative and generative models, offering a path to improving their robustness for tasks involving logical reasoning. Source code and data are made publicly available at https://github.com/Strong-AI-Lab/Logical-and-abstract-reasoning.

Summary

The paper finds that standard LLMs, including GPT-3.5 and GPT-4, significantly drop in performance on logical reasoning tasks when task structures are altered.
The paper demonstrates that instruction fine-tuning combined with logic-driven data augmentation improves model generalisation on larger datasets, while smaller sets lag behind.
The paper reveals that chain-of-thought prompting offers limited benefits and that increased model size does not necessarily correlate with enhanced robustness in reasoning tasks.

Robustness of LLMs in Logical Reasoning

This paper (2310.09430) addresses the robustness of LLMs when performing logical reasoning, an area where their generalization capabilities are not fully understood. The paper introduces augmented datasets and evaluates the impact of task structure variations, fine-tuning, and data augmentation techniques on model performance.

Methodology and Data Perturbation

To evaluate the generalisation and robustness of LLMs, the authors created three new datasets, "ReClor-plus", "LogiQA-plus" and "LogiQAv2-plus," by applying data perturbation procedures to three existing logical reasoning datasets. These datasets feature three subsets designed to assess different aspects of LLM reasoning:

Shuffle-Order: Shuffles the order of the options to test if the model reasons or memorises the position of the correct answer.
Replace-Answer: Replaces the correct answer with "none of the other options is correct" to evaluate if the model understands that other options are incorrect.
Shuffle-RepAns: Combines the variations from Shuffle-Order and Replace-Answer to evaluate more complex reasoning.

Experimental Findings

The experiments yielded several key findings regarding the robustness of LLMs:

Existing LLMs like GPT-3.5 and GPT-4 perform well on logical reasoning tasks in the original format, but their performance drops significantly on the new formats, indicating a potential lack of generalised logical reasoning capabilities and possible data leakage.
Instruction fine-tuning can help LLMs increase their generalisation and robustness on logical reasoning tasks. Logic-driven data augmentation for fine-tuning, combined with prompting, can enhance the generalisation performance of both discriminative and generative LLMs.
For large training set sizes, a high ratio of perturbed data can help increase generative LLM's performance on most logical reasoning tasks. However, this does not work with small training sets.
There is no direct correlation between model size (from LLaMA-7B to LLaMA-65B) and its generalisation and robustness on logical reasoning tasks. A larger model does not necessarily guarantee better generalisation and robustness on logical reasoning tasks.

Impact of Chain-of-Thought Prompting

The paper also explores the use of CoT prompting to improve the performance of LLMs. However, the results indicate that CoT prompting does not lead to significant improvements in performance, except for GPT-4, which exhibits systematic accuracy improvements on the Shuffle-RepAns task. The authors suggest that since CoT prompting does not offer any explicitly useful information as additional input, and the model has not been trained to respond correctly when given the CoT prompting, it is reasonable to expect that it may not perform well in complex logical reasoning scenarios.

Data Augmentation and Transfer Learning

The paper investigates the impact of logic-driven data augmentation on the generalisation and robustness of LLMs. The findings suggest that logic-driven data augmentation is detrimental to the generalisation and robustness of LLMs trained using next-token prediction for logical reasoning tasks. The authors hypothesise that logic-driven data augmentation does not directly map to the task of next-token prediction, which may disturb the model's training.

The paper performs transfer learning experiments to investigate the extent to which incorporating variations of task structure into the training set can help models improve their performance on logical reasoning tasks. The results indicate that using a more expansive training dataset and a higher perturbation ratio demonstrates beneficial.

Conclusions

The paper concludes that LLMs exhibit significant limitations in generalisation and robustness when applied to logical reasoning tasks. While instruction fine-tuning can improve performance, CoT prompting alone is insufficient for robust reasoning. The amount of data perturbation and task structure modifications required for improved adaptability depends on the dataset size. Model size, within the same LLaMA base model framework, does not guarantee better generalisation or robustness. Logic-driven data augmentation benefits fine-tuned discriminative models but can be detrimental to generative models.