Emergent Mind

Abstract

We analyze the behaviors of open LLMs on the task of data-to-text (D2T) generation, i.e., generating coherent and relevant text from structured data. To avoid the issue of LLM training data contamination with standard benchmarks, we design Quintd - a tool for collecting novel structured data records from public APIs. Using a dataset collected with Quintd and leveraging reference-free evaluation, we analyze model behaviors on five D2T generation tasks. We find that recent open LLMs (Llama2, Mistral, and Zephyr) can generate fluent and coherent text from standard data formats in zero-shot settings. However, we also show that the semantic accuracy of the outputs is a major issue: both according to our GPT-4-based metric and human annotators, more than 80% of the outputs of open LLMs contain a semantic error. We publicly release the code, data, and model outputs.

Overview

  • LLMs are explored for data-to-text (D2T) generation, a task that requires generating text from structured data with fluency and semantic accuracy.

  • The paper introduces Quintd-1, a new benchmark with structured data records in five domains, designed for evaluating LLMs' performance in D2T tasks without the bias of overfitting.

  • An experimental study with three 7B-parameter open-source LLMs shows that while they can produce fluent text, 80%-91% of their outputs contain semantic errors.

  • The authors suggest a shift in focus towards semantic accuracy, efficiency in processing long data, and the use of reproducible evaluation methods for D2T generation.

  • The research provides insights for creating more accurate LLMs for D2T tasks and considers the challenges of multilinguality and practical applications.

Overview of LLMs and Data-to-Text Generation

LLMs have become widely recognized for their versatile applications in NLP. One intriguing application is data-to-text (D2T) generation, where the challenge lies in creating coherent text from structured data. This requires not just fluency in language generation, but also maintaining semantic accuracy—a notable challenge for LLMs. This blog post discusses an innovative approach to evaluating the performance of LLMs in D2T tasks that sidesteps conventional benchmarks which might be biased due to overfitting on leaked data.

Quintd-1: A New Benchmark for D2T Evaluation

Researchers have devised Quintd-1, a new benchmark that consists of structured data records across five different domains—weather forecasts, product descriptions, sports summaries, health-related time series and world fact descriptions. Quintd-1 relies on standard data formats like JSON, CSV, and Markdown to provide inputs for D2T tasks that are well-represented in the pretraining corpora of many LLMs. This strategy leverages the 'in-context learning abilities' of these models, allowing evaluation without the need for human-written reference texts.

Methodology and Model Behavior

The study explores the capabilities of three open-source 7B-parameter LLMs—Llama-2, Mistral, and Zephyr—to perform D2T tasks across various domains. The experimental setup is straightforward, using a template prompt across all tasks to see if models can generate outputs on unseen data with minimal prompt engineering. The findings show that while the models can produce fluent text, approximately 80%–91% of the outputs involve some form of semantic error, highlighting the struggle with semantic accuracy.

Moving Forward with D2T Generation

The insights from this work prompt several recommendations. Primarily, the focus should shift from linguistic fluency to semantic accuracies, such as improving content selection and factual correctness. Efficiency should be another area of consideration, especially when dealing with long data inputs. Finally, the research underscores the importance of reproducible and unbiased evaluation methods, signaling a path forward for future studies using LLMs for D2T generation.

The paper paves the way for better D2T systems by providing detailed observations, data, and insights that can help in creating more reliable and accurate language generation models in the future. It also opens up considerations such as multilinguality and real-world application of D2T systems. Given the complexities and nuances of natural language, the journey of refining LLMs to impeccably perform D2T tasks is ongoing, yet promising.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.