Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 42 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 429 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Beyond Traditional Benchmarks: Analyzing Behaviors of Open LLMs on Data-to-Text Generation (2401.10186v3)

Published 18 Jan 2024 in cs.CL

Abstract: We analyze the behaviors of open LLMs on the task of data-to-text (D2T) generation, i.e., generating coherent and relevant text from structured data. To avoid the issue of LLM training data contamination with standard benchmarks, we design Quintd - a tool for collecting novel structured data records from public APIs. We find that open LLMs (Llama 2, Mistral, and Zephyr) can generate fluent and coherent texts in zero-shot settings from data in common formats collected with Quintd. However, we show that the semantic accuracy of the outputs is a major issue: both according to human annotators and our reference-free metric based on GPT-4, more than 80% of the outputs of open LLMs contain at least one semantic error. We publicly release the code, data, and model outputs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, pages 3554–3565, Online.
  2. Can we trust the evaluation on ChatGPT? In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 47–54, Toronto, Canada.
  3. A Simple Domain-independent Probabilistic Approach to Generation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, 9-11 October 2010, MIT Stata Center, Massachusetts, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 502–512.
  4. Anonymous. 2023. Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-source LLMs. OpenReview. https://openreview.net/forum?id=exbPWKOyzF.
  5. Agnes Axelsson and Gabriel Skantze. 2023. Using Large Language Models for Zero-shot Natural Language Generation from Knowledge Graphs. CoRR, abs/2307.07312.
  6. Constrained Decoding for Neural NLG from Compositional Representations in Task-oriented Dialogue. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Volume 1: Long Papers, pages 831–844, Florence, Italy.
  7. Open LLM Leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  8. Anja Belz. 2005. Corpus-driven generation of weather forecasts. In Proc. 3rd Corpus Linguistics Conference.
  9. Anja Belz. 2008. Automatic generation of weather forecast texts using comprehensive probabilistic generation-space models. Nat. Lang. Eng., 14(4):431–455.
  10. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  11. Gpt-neox-20b: An open-source autoregressive language model. In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136.
  12. Armand Boschin and Thomas Bonald. 2019. Wikidatasets: standardized sub-graphs from wikidata. arXiv preprint arXiv:1906.04536.
  13. How is ChatGPT’s behavior changing over time? CoRR, abs/2307.09009.
  14. KGPT: Knowledge-grounded Pre-training for Data-to-text Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, pages 8635–8648, Online.
  15. David Cheng-Han Chiang and Hung-yi Lee. 2023. Can Large Language Models Be an Alternative to Human Evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, pages 15607–15631, Toronto, Canada.
  16. The 2023 WebNLG Shared Task on Low Resource Languages. Overview and Evaluation Results (WebNLG 2023). In Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023), pages 55–66, Prague, Czech Republic.
  17. FlashAttention: Fast and Memory-efficient Exact Attention with IO-awareness. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA.
  18. Generating Textual Summaries of Bar Charts. In INLG 2008 - Proceedings of the Fifth International Natural Language Generation Conference, June 12-14, 2008, Salt Fork, Ohio, USA.
  19. Summarizing Information Graphics Textually. Comput. Linguistics, 38(3):527–574.
  20. The 2020 Bilingual, Bi-Directional Webnlg+ Shared Task Overview and Evaluation Results (Webnlg+ 2020). In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+).
  21. GPTScore: Evaluate as You Desire. CoRR, abs/2302.04166.
  22. The WebNLG Challenge: Generating Text from RDF Data. In Proceedings of the 10th International Conference on Natural Language Generation, INLG 2017, Santiago de Compostela, pages 124–133, Spain.
  23. Albert Gatt and Emiel Krahmer. 2018. Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation. J. Artif. Intell. Res., 61:65–170.
  24. Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text. J. Artif. Intell. Res., 77:103–166.
  25. Shahriar Golchin and Mihai Surdeanu. 2023. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. CoRR, abs/2308.08493.
  26. Generative Models as a Complex Systems Science: How can we make sense of large language model behavior? CoRR, abs/2308.00189.
  27. Mistral 7B. CoRR, abs/2310.06825.
  28. Chart-to-Text: A Large-scale Benchmark for Chart Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, pages 4005–4023, Dublin, Ireland.
  29. TabGenie: A Toolkit for Table-to-text Generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL 2023, pages 444–455, Toronto, Canada.
  30. Mind the Labels: Describing Relations in Knowledge Graphs With Pretrained Models. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, pages 2390–2407, Croatia.
  31. Tom Kocmi and Christian Federmann. 2023a. GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4. In Proceedings of the Eighth Conference on Machine Translation, WMT 2023, pages 768–775, Singapore.
  32. Tom Kocmi and Christian Federmann. 2023b. Large Language Models Are State-of-the-art Evaluators of Translation Quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, EAMT 2023, pages 193–203, Tampere, Finland.
  33. Text Generation from Knowledge Graphs with Graph Transformers. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, Volume 1 (Long and Short Papers), pages 2284–2293, USA.
  34. Can Pretrained Language Models Generate Persuasive, Faithful, and Informative Ad Text for Product Descriptions? In Proceedings of the Fifth Workshop on E-Commerce and NLP (ECNLP 5), pages 234–243, Dublin, Ireland.
  35. Neural Text Generation from Structured Data with Application to the Biography Domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, pages 1203–1213, Austin, Texas, USA.
  36. Faithfulness in Natural Language Generation: A Systematic Survey of Analysis, Evaluation and Optimization Methods. CoRR, abs/2203.05227.
  37. Learning Semantic Correspondences with Less Supervision. In ACL 2009, Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2-7 August 2009, pages 91–99, Singapore.
  38. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, pages 2511–2522, Singapore.
  39. Michela Lorandi and Anja Belz. 2023. Data-to-text Generation for Severely Under-resourced Languages with GPT-3.5: A Bit of Help Needed from Google Translate (WebNLG 2023). In Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023), pages 80–86.
  40. Operation-guided Neural Networks for High Fidelity Data-To-text Generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3879–3889, Brussels, Belgium.
  41. Why We Need New Evaluation Metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, pages 2241–2252, Copenhagen, Denmark.
  42. Jason Obeid and Enamul Hoque. 2020. Chart-to-Text: Generating Natural Language Descriptions for Charts by Adapting the Transformer Model. In Proceedings of the 13th International Conference on Natural Language Generation, INLG 2020, pages 138–147, Dublin, Ireland.
  43. OpenAI. 2023a. GPT-4 Technical Report. CoRR, abs/2303.08774.
  44. OpenAI. 2023b. Introducing ChatGPT. https://openai.com/blog/chatgpt. Accessed on January 9, 2024.
  45. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA.
  46. Data-to-text Generation with Content Selection and Planning. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, pages 6908–6915, Honolulu, Hawaii, USA.
  47. Data-to-text Generation with Entity Modeling. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Volume 1: Long Papers, pages 2023–2035, Florence, Italy.
  48. Data-to-text Generation with Variational Sequential Planning. Trans. Assoc. Comput. Linguistics, 10:697–715.
  49. Ratish Puduppully and Mirella Lapata. 2021. Data-to-text Generation with Macro Planning. Trans. Assoc. Comput. Linguistics, 9:510–527.
  50. A Hierarchical Model for Data-to-text Generation. In Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part I, volume 12035 of Lecture Notes in Computer Science, pages 65–80.
  51. Ehud Reiter. 2023. We should evaluate real-world impact! https://ehudreiter.com/2023/11/13/evaluate-real-world-impact/. Accessed on January 11, 2024.
  52. Ehud Reiter and Robert Dale. 1997. Building applied natural language generation systems. Nat. Lang. Eng., 3(1):57–87.
  53. Ehud Reiter and Craig Thomson. 2020. Shared Task on Evaluating Accuracy. In Proceedings of the 13th International Conference on Natural Language Generation, INLG 2020, pages 227–231, Dublin, Ireland.
  54. Investigating Pretrained Language Models for Graph-to-text Generation. CoRR, abs/2007.08426.
  55. Anna Rogers. 2023. Closed AI Models Make Bad Baselines. https://hackingsemantics.xyz/2023/closed-baselines/. Accessed on January 11, 2024.
  56. Patrícia Schmidtová. 2023. Semantic Accuracy in Natural Language Generation: A Thesis Proposal. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, ACL 2023, pages 352–361, Toronto, Canada.
  57. Controllable and Diverse Text Generation in E-commerce. In WWW ’21: The Web Conference 2021, pages 2392–2401, Virtual Event / Ljubljana, Slovenia.
  58. TCube: Domain-agnostic Neural Time-series Narration. CoRR, abs/2110.05633.
  59. Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, pages 8776–8788, Singapore.
  60. Sumtime-meteo: Parallel corpus of naturally occurring forecast texts and weather data. Computing Science Department, University of Aberdeen, Aberdeen, Scotland, Tech. Rep. AUCS/TR0201.
  61. Craig Thomson and Ehud Reiter. 2020. A Gold Standard Methodology for Evaluating Accuracy in Data-To-text Systems. In Proceedings of the 13th International Conference on Natural Language Generation, INLG 2020, pages 158–168, Dublin, Ireland.
  62. SportSett:Basketball - A Robust and Maintainable Dataset for Natural Language Generation. page 9.
  63. Evaluating factual accuracy in complex data-to-text. Computer Speech & Language, 80:101482.
  64. TogetherAI. 2023. Preparing for the era of 32K context: Early learnings and explorations. https://www.together.ai/blog/llama-2-7b-32k. Accessed on January 2, 2024.
  65. LLaMA: Open and Efficient Foundation Language Models. CoRR, abs/2302.13971.
  66. Llama 2: Open Foundation and Fine-tuned Chat Models. CoRR, abs/2307.09288.
  67. Zephyr: Direct Distillation of LM Alignment. CoRR, abs/2310.16944.
  68. Ashish Upadhyay and Stewart Massie. 2022. Content Type Profiling of Data-to-text Generation Datasets. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, pages 5770–5782, Gyeongju, Republic of Korea.
  69. Human evaluation of automatically generated text: Current trends and best practice guidelines. Comput. Speech Lang., 67:101151.
  70. Hongmin Wang. 2019. Revisiting Challenges in Data-to-text Generation with Fact Grounding. In Proceedings of the 12th International Conference on Natural Language Generation, INLG 2019, pages 311–322, Tokyo, Japan.
  71. Is ChatGPT a Good NLG Evaluator? A Preliminary Study. CoRR, abs/2303.04048.
  72. A Statistical Framework for Product Description Generation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, IJCNLP 2017, Volume 2: Short Papers, pages 187–192, Taipei, Taiwan.
  73. Large Language Models are not Fair Evaluators. CoRR, abs/2305.17926.
  74. Toward multi-domain language generation using recurrent neural networks. In NIPS Workshop on Machine Learning for Spoken Language Understanding and Interaction.
  75. Multi-domain Neural Network Language Generation for Spoken Dialogue Systems. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 120–129, San Diego California, USA.
  76. Challenges in Data-to-document Generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, pages 2253–2263, Copenhagen, Denmark.
  77. Transformers: State-of-the-art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, pages 38–45, Online.
  78. Effective Long-context Scaling of Foundation Models. CoRR, abs/2309.16039.
  79. INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback.
  80. Shuzhou Yuan and Michael Färber. 2023. Evaluating Generative Models for Graph-to-text Generation. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, RANLP 2023, pages 1256–1264, Varna, Bulgaria.
  81. Investigating Table-to-Text Generation Capabilities of LLMs in Real-World Information Seeking Scenarios.
  82. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. CoRR, abs/2306.05685.
Citations (3)

Summary

  • The paper introduces Quintd-1, a new benchmark to evaluate semantic accuracy in data-to-text generation.
  • It compares three 7B-parameter open LLMs using a uniform prompting method across diverse domains.
  • Findings reveal that while models produce fluent language, 80%–91% of outputs contain semantic errors, emphasizing the need for improved evaluation methods.

Overview of LLMs and Data-to-Text Generation

LLMs have become widely recognized for their versatile applications in NLP. One intriguing application is data-to-text (D2T) generation, where the challenge lies in creating coherent text from structured data. This requires not just fluency in language generation, but also maintaining semantic accuracy—a notable challenge for LLMs. This blog post discusses an innovative approach to evaluating the performance of LLMs in D2T tasks that sidesteps conventional benchmarks which might be biased due to overfitting on leaked data.

Quintd-1: A New Benchmark for D2T Evaluation

Researchers have devised Quintd-1, a new benchmark that consists of structured data records across five different domains—weather forecasts, product descriptions, sports summaries, health-related time series and world fact descriptions. Quintd-1 relies on standard data formats like JSON, CSV, and Markdown to provide inputs for D2T tasks that are well-represented in the pretraining corpora of many LLMs. This strategy leverages the 'in-context learning abilities' of these models, allowing evaluation without the need for human-written reference texts.

Methodology and Model Behavior

The paper explores the capabilities of three open-source 7B-parameter LLMs—Llama-2, Mistral, and Zephyr—to perform D2T tasks across various domains. The experimental setup is straightforward, using a template prompt across all tasks to see if models can generate outputs on unseen data with minimal prompt engineering. The findings show that while the models can produce fluent text, approximately 80%–91% of the outputs involve some form of semantic error, highlighting the struggle with semantic accuracy.

Moving Forward with D2T Generation

The insights from this work prompt several recommendations. Primarily, the focus should shift from linguistic fluency to semantic accuracies, such as improving content selection and factual correctness. Efficiency should be another area of consideration, especially when dealing with long data inputs. Finally, the research underscores the importance of reproducible and unbiased evaluation methods, signaling a path forward for future studies using LLMs for D2T generation.

The paper paves the way for better D2T systems by providing detailed observations, data, and insights that can help in creating more reliable and accurate language generation models in the future. It also opens up considerations such as multilinguality and real-world application of D2T systems. Given the complexities and nuances of natural language, the journey of refining LLMs to impeccably perform D2T tasks is ongoing, yet promising.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 22 likes.

Upgrade to Pro to view all of the tweets about this paper: