Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? (2309.08963v3)

Published 16 Sep 2023 in cs.CL

Abstract: Despite the remarkable capabilities of LLMs like GPT-4, producing complex, structured tabular data remains challenging. Our study assesses LLMs' proficiency in structuring tables and introduces a novel fine-tuning method, cognizant of data structures, to bolster their performance. We unveil Struc-Bench, a comprehensive benchmark featuring prominent LLMs (GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna), which spans text tables, HTML, and LaTeX formats. Our proposed FormatCoT aids in crafting format-specific instructions from the intended outputs to populate this benchmark. Addressing the gap in task-centered evaluation, we propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score), to more accurately gauge LLM performance. Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains, outshining its LLM counterparts across most measures. In-depth error analysis and creating an ability map across six dimensions -- coverage, formatting, reasoning, comprehension, pragmatics, and hallucination -- highlight areas for future enhancements and suggest forthcoming research trajectories. Our code and models can be found at https://github.com/gersteinlab/Struc-Bench.

References (30)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a benchmark, Struc-Bench, to evaluate LLM performance in generating complex structured data across multiple formats.
The paper presents FormatCoT, a structure-aware fine-tuning approach that improves LLaMA-7B's adherence to required formatting specifications.
The study maps LLM capabilities across six dimensions, highlighting common errors and practical implications for automated reporting and data visualization.

Evaluating LLMs on Complex Structured Data Generation

The paper "Struc-Bench: Are LLMs Really Good at Generating Complex Structured Data?" addresses a critical yet underexplored area in the capabilities of LLMs: their proficiency in generating complex, structured data. While models like GPT-4 have demonstrated remarkable prowess in generating natural language text, their performance on tasks requiring structured outputs—such as tables in formats like raw text, HTML, and LaTeX—remains questionable. This paper embarks on a comprehensive assessment of LLMs in this regard and proposes a new solution to enhance their capabilities.

Struc-Bench and Evaluation of LLMs

The authors introduce Struc-Bench, a structured data generation benchmark, comprising carefully constructed datasets across multiple formats. The benchmark scrutinizes the abilities of well-recognized LLMs, including GPT-NeoX-20B, GPT-3.5, GPT-4, and Vicuna, revealing common formatting errors and identifying areas for potential improvement.

A significant contribution of this paper is the creation of a model capability map across six dimensions: coverage, formatting, reasoning, comprehension, pragmatics, and hallucination. This map underscores the inherent weaknesses of LLMs in managing complex structured outputs. Analysis illustrates that the evaluated models often fall short in maintaining structural fidelity and content accuracy, particularly when handling intricate data structures such as tables.

FormatCoT: A Structure-Aware Fine-Tuning Approach

To address these shortcomings, the authors propose a structure-aware fine-tuning solution named FormatCoT (Chain-of-Thought). This method involves generating detailed format instructions derived from target outputs. Through their experiments, they observe that fine-tuning LLaMA-7B with this approach notably improves the model’s adherence to structural constraints across multiple data formats.

In the comparative analysis, it is evident that the proposed fine-tuning dramatically enhances the LLaMA-7B model's capacity to generate structured outputs, outperforming other examined LLMs. The evaluation includes comprehensive metrics such as SacreBLEU, ROUGE-L, BERTScore, and new methodologies like GPTScore and H-Score, offering a holistic view of model performance.

Implications and Future Directions

The findings hold substantial practical implications, particularly for applications necessitating precise structured data generation, such as automated reporting systems, coding assistive tools, and data visualization processes. The paper suggests that there is considerable room for growth in LLMs, particularly in domains requiring structured output generation.

Future investigations may delve into expanding domain-specific benchmarks and exploring multi-modal LLMs capable of processing more varied data modalities. Additionally, advancements in techniques to bolster LLMs' numerical reasoning and structured data handling capabilities could greatly enhance their utility in practical applications.

Overall, the work presented in this paper paves the way for a nuanced understanding of LLM capabilities in structured data contexts and opens pathways for further refinement and exploration in the domain of structured text generation.

PDF Markdown

Related Papers

GitHub

GitHub - gersteinlab/Struc-Bench (54 stars)

YouTube

Show All Videos