Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? (2309.08963v3)
Abstract: Despite the remarkable capabilities of LLMs like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs' proficiency in structuring tables and introduces a novel fine-tuning method, cognizant of data structures, to bolster their performance. We unveil Struc-Bench, a comprehensive benchmark featuring prominent LLMs (GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna), which spans text tables, HTML, and LaTeX formats. Our proposed FormatCoT aids in crafting format-specific instructions from the intended outputs to populate this benchmark. Addressing the gap in task-centered evaluation, we propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score), to more accurately gauge LLM performance. Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains, outshining its LLM counterparts across most measures. In-depth error analysis and creating an ability map across six dimensions -- coverage, formatting, reasoning, comprehension, pragmatics, and hallucination -- highlight areas for future enhancements and suggest forthcoming research trajectories. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench.
- Table-to-text: Describing table region with natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
- Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
- Revisiting event argument extraction: Can eae models learn better when being aware of event co-occurrences? arXiv preprint arXiv:2306.00502.
- Table and image generation for investigating knowledge of entities in pre-trained vision and language models. arXiv preprint arXiv:2306.02115.
- Neural text generation from structured data with application to the biography domain. arXiv preprint arXiv:1603.07771.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
- A sequence-to-sequence&set model for text-to-table generation. arXiv preprint arXiv:2306.00137.
- Large language model is not a good few-shot information extractor, but a good reranker for hard samples! arXiv preprint arXiv:2303.08559.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
- The e2e dataset: New challenges for end-to-end generation. arXiv preprint arXiv:1706.09254.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
- Stable: Table generation framework for encoder-decoder models. arXiv preprint arXiv:2206.04045.
- Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Knowgl: Knowledge generation and linking from text. arXiv preprint arXiv:2210.13952.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Llama: Open and efficient foundation language models.
- Large language models are not fair evaluators.
- Chain-of-thought prompting elicits reasoning in large language models.
- Webie: Faithful and robust information extraction on the web. arXiv preprint arXiv:2305.14293.
- Challenges in data-to-document generation. arXiv preprint arXiv:1707.08052.
- Text-to-table: A new way of information extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2518–2533, Dublin, Ireland. Association for Computational Linguistics.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Large language models are effective table-to-text generators, evaluators, and feedback providers. arXiv preprint arXiv:2305.14987.
- Robut: A systematic study of table qa robustness against human-annotated adversarial perturbations. arXiv preprint arXiv:2306.14321.
- Zexuan Zhong and Danqi Chen. 2020. A frustratingly easy approach for entity and relation extraction. arXiv preprint arXiv:2010.12812.