- The paper finds that standard LLMs, including GPT-3.5 and GPT-4, significantly drop in performance on logical reasoning tasks when task structures are altered.
- The paper demonstrates that instruction fine-tuning combined with logic-driven data augmentation improves model generalisation on larger datasets, while smaller sets lag behind.
- The paper reveals that chain-of-thought prompting offers limited benefits and that increased model size does not necessarily correlate with enhanced robustness in reasoning tasks.
Robustness of LLMs in Logical Reasoning
This paper (2310.09430) addresses the robustness of LLMs when performing logical reasoning, an area where their generalization capabilities are not fully understood. The paper introduces augmented datasets and evaluates the impact of task structure variations, fine-tuning, and data augmentation techniques on model performance.
Methodology and Data Perturbation
To evaluate the generalisation and robustness of LLMs, the authors created three new datasets, "ReClor-plus", "LogiQA-plus" and "LogiQAv2-plus," by applying data perturbation procedures to three existing logical reasoning datasets. These datasets feature three subsets designed to assess different aspects of LLM reasoning:
- Shuffle-Order: Shuffles the order of the options to test if the model reasons or memorises the position of the correct answer.
- Replace-Answer: Replaces the correct answer with "none of the other options is correct" to evaluate if the model understands that other options are incorrect.
- Shuffle-RepAns: Combines the variations from Shuffle-Order and Replace-Answer to evaluate more complex reasoning.
Experimental Findings
The experiments yielded several key findings regarding the robustness of LLMs:
- Existing LLMs like GPT-3.5 and GPT-4 perform well on logical reasoning tasks in the original format, but their performance drops significantly on the new formats, indicating a potential lack of generalised logical reasoning capabilities and possible data leakage.
- Instruction fine-tuning can help LLMs increase their generalisation and robustness on logical reasoning tasks. Logic-driven data augmentation for fine-tuning, combined with prompting, can enhance the generalisation performance of both discriminative and generative LLMs.
- For large training set sizes, a high ratio of perturbed data can help increase generative LLM's performance on most logical reasoning tasks. However, this does not work with small training sets.
- There is no direct correlation between model size (from LLaMA-7B to LLaMA-65B) and its generalisation and robustness on logical reasoning tasks. A larger model does not necessarily guarantee better generalisation and robustness on logical reasoning tasks.
Impact of Chain-of-Thought Prompting
The paper also explores the use of CoT prompting to improve the performance of LLMs. However, the results indicate that CoT prompting does not lead to significant improvements in performance, except for GPT-4, which exhibits systematic accuracy improvements on the Shuffle-RepAns task. The authors suggest that since CoT prompting does not offer any explicitly useful information as additional input, and the model has not been trained to respond correctly when given the CoT prompting, it is reasonable to expect that it may not perform well in complex logical reasoning scenarios.
Data Augmentation and Transfer Learning
The paper investigates the impact of logic-driven data augmentation on the generalisation and robustness of LLMs. The findings suggest that logic-driven data augmentation is detrimental to the generalisation and robustness of LLMs trained using next-token prediction for logical reasoning tasks. The authors hypothesise that logic-driven data augmentation does not directly map to the task of next-token prediction, which may disturb the model's training.
The paper performs transfer learning experiments to investigate the extent to which incorporating variations of task structure into the training set can help models improve their performance on logical reasoning tasks. The results indicate that using a more expansive training dataset and a higher perturbation ratio demonstrates beneficial.
Conclusions
The paper concludes that LLMs exhibit significant limitations in generalisation and robustness when applied to logical reasoning tasks. While instruction fine-tuning can improve performance, CoT prompting alone is insufficient for robust reasoning. The amount of data perturbation and task structure modifications required for improved adaptability depends on the dataset size. Model size, within the same LLaMA base model framework, does not guarantee better generalisation or robustness. Logic-driven data augmentation benefits fine-tuned discriminative models but can be detrimental to generative models.