G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model (2312.11370v1)
Abstract: LLMs have shown remarkable proficiency in human-level reasoning and generation capabilities, which encourages extensive research on their application in mathematical problem solving. However, current work has been largely focused on text-based mathematical problems, with limited investigation in problems involving geometric information. Addressing this gap, we aim to enable LLMs to solve geometric problems by understanding image input. We first analyze the limitations of current Multimodal LLMs (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships. To overcome these challenges, we take advantage of the unique characteristics of geometric problems (such as unique geometric logical form, and geometric scalability) and the capacity of the textual LLMs to build an enriched multimodal geometry dataset based on existing data. The augmented dataset, Geo170K, contains more than 170K geometric image-caption and question-answer pairs. Utilizing our constructed Geo170K dataset, we develop G-LLaVA, which demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.
- Synthesis of solutions for shaded area geometry problems. In The Thirtieth International Flairs Conference.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Jie Cao and Jing Xiao. 2022. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1511–1520.
- UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3313–3323.
- GeoQA: A geometric question answering benchmark towards multimodal numerical reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 513–523, Online. Association for Computational Linguistics.
- Shikra: Unleashing multimodal llm’s referential dialogue magic.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Training verifiers to solve math word problems.
- Instructblip: Towards general-purpose vision-language models with instruction tuning.
- Specializing smaller language models towards multi-step reasoning.
- Self-guided noise-free data generation for efficient zero-shot learning.
- Llama-adapter v2: Parameter-efficient visual instruction model.
- Google. 2023. Gemini: A family of highly capable multimodal models.
- Tora: A tool-integrated reasoning agent for mathematical problem solving.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. In The Eleventh International Conference on Learning Representations.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.
- Unimath: A foundational and multimodal mathematical reasoner. In EMNLP.
- Visual instruction tuning.
- Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models.
- Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. In The 59th Annual Meeting of the Association for Computational Linguistics (ACL).
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.
- Generating training data with language models: Towards zero-shot language understanding. arXiv preprint arXiv:2202.04538.
- A symbolic characters aware model for solving geometry problems. In Proceedings of the 31st ACM International Conference on Multimedia, MM ’23, page 7767–7775, New York, NY, USA. Association for Computing Machinery.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Instruction tuning with gpt-4.
- Detgpt: Detect what you need via reasoning.
- Perceptiongpt: Effectively fusing visual perception into llm.
- Learning transferable visual models from natural language supervision.
- From textbooks to knowledge: A case study in harvesting axiomatic knowledge from textbooks to solve geometry problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 773–784.
- Mrinmaya Sachan and Eric Xing. 2017. Learning to solve geometry problems from natural language demonstrations in textbooks. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (* SEM 2017), pages 251–261.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1466–1476.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
- Pandagpt: One model to instruction-follow them all.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Self-instruct: Aligning language models with self-generated instructions.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Large language models are better reasoners with self-verification. CoRR, abs/2212.09561.
- The dawn of lmms: Preliminary explorations with gpt-4v(ision).
- Zerogen: Efficient zero-shot learning via dataset generation. In Empirical Methods in Natural Language Processing.
- ProGen: Progressive zero-shot dataset generation via in-context feedback. In Findings of the Association for Computational Linguistics: EMNLP 2022.
- Metamath: Bootstrap your own mathematical questions for large language models.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.
- Mammoth: Building math generalist models through hybrid instruction tuning.
- Gpt4roi: Instruction tuning large language model on region-of-interest.
- Sego: Sequential subgoal optimization for mathematical problem-solving. arXiv preprint arXiv:2310.12960.
- Decomposing the enigma: Subgoal-based demonstration learning for formal theorem proving. arXiv preprint arXiv:2305.16366.
- Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models.