German Text Simplification: Finetuning Large Language Models with Semi-Synthetic Data (2402.10675v1)

Published 16 Feb 2024 in cs.CL

Abstract: This study pioneers the use of synthetically generated data for training generative models in document-level text simplification of German texts. We demonstrate the effectiveness of our approach with real-world online texts. Addressing the challenge of data scarcity in language simplification, we crawled professionally simplified German texts and synthesized a corpus using GPT-4. We finetune LLMs with up to 13 billion parameters on this data and evaluate their performance. This paper employs various methodologies for evaluation and demonstrates the limitations of currently used rule-based metrics. Both automatic and manual evaluations reveal that our models can significantly simplify real-world online texts, indicating the potential of synthetic data in improving text simplification.

Authors (4)

Lars Klöser (4 papers)
Mika Beele (1 paper)
Jan-Niklas Schagen (1 paper)
Bodo Kraft (6 papers)

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

German Text Simplification: Finetuning Large Language Models with Semi-Synthetic Data (2402.10675v1)

Summary

Related Papers

Tweets