T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering (2305.03453v4)
Abstract: LLMs have recently demonstrated exceptional performance in various NLP tasks. They have also shown the ability to perform chain-of-thought (CoT) reasoning to solve complex problems. Recent studies have explored CoT reasoning in complex multimodal scenarios, such as the science question answering task, by fine-tuning multimodal models with high-quality human-annotated CoT rationales. However, collecting high-quality COT rationales is usually time-consuming and costly. Besides, the annotated rationales are hardly accurate due to the external essential information missed. To address these issues, we propose a novel method termed T-SciQ that aims at teaching science question answering with LLM signals. The T-SciQ approach generates high-quality CoT rationales as teaching signals and is advanced to train much smaller models to perform CoT reasoning in complex modalities. Additionally, we introduce a novel data mixing strategy to produce more effective teaching data samples for simple and complex science question answer problems. Extensive experimental results show that our T-SciQ method achieves a new state-of-the-art performance on the ScienceQA benchmark, with an accuracy of 96.18%. Moreover, our approach outperforms the most powerful fine-tuned baseline by 4.5%. The code is publicly available at https://github.com/T-SciQ/T-SciQ.
- Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6077–6086.
- Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
- End-to-End Object Detection with Transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I, 213–229.
- Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33: 22243–22255.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Explaining answers with entailment trees. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Specializing Smaller Language Models towards Multi-Step Reasoning. arXiv preprint arXiv:2301.12726.
- Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720.
- Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6639–6648.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9: 346–361.
- ICL-D3IE: In-context learning with diverse demonstrations updating for document information extraction. arXiv preprint arXiv:2303.05063.
- Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 770–778. IEEE Computer Society.
- Large Language Models Are Reasoning Teachers. arXiv preprint arXiv:2212.10071.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
- LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. arXiv preprint arXiv:2304.01933.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610.
- Worldtree: A corpus of explanation graphs for elementary science questions supporting multi-hop inference. arXiv preprint arXiv:1802.03052.
- Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4999–5007.
- Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700.
- Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406.
- Bilinear attention networks. Advances in neural information processing systems, 31.
- Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, 5583–5594. PMLR.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
- On Vision Features in Multimodal Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 6327–6337.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
- On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35: 2507–2521.
- Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models. arXiv preprint arXiv:2304.09842.
- Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610.
- Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214.
- Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
- Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
- OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
- OpenAI. 2023. GPT-4 Technical Report. CoRR, abs/2303.08774.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763. PMLR.
- Learning to retrieve prompts for in-context learning. arXiv preprint arXiv:2112.08633.
- Visuo-Lingustic Question Answering (VLQA) Challenge. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings (EMNLP), 4606–4616.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
- R33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT Prompting: Review, Rephrase and Resolve for Chain-of-Thought Reasoning in Large Language Models under Noisy Context. arXiv preprint arXiv:2310.16535.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Attention is All you Need. In Advances in Neural Information Processing Systems 30, 5998–6008.
- Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2714–2730.
- Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
- Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv preprint, abs/2201.11903.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6281–6290.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
- Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
- Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
- Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.