Emergent Mind

The SAMER Arabic Text Simplification Corpus

(2404.18615)
Published Apr 29, 2024 in cs.CL

Abstract

We present the SAMER Corpus, the first manually annotated Arabic parallel corpus for text simplification targeting school-aged learners. Our corpus comprises texts of 159K words selected from 15 publicly available Arabic fiction novels most of which were published between 1865 and 1955. Our corpus includes readability level annotations at both the document and word levels, as well as two simplified parallel versions for each text targeting learners at two different readability levels. We describe the corpus selection process, and outline the guidelines we followed to create the annotations and ensure their quality. Our corpus is publicly available to support and encourage research on Arabic text simplification, Arabic automatic readability assessment, and the development of Arabic pedagogical language technologies.

Overview

  • The SAMER project created an annotated Arabic corpus tailored for school-aged learners, derived from 15 historic novels, to simplify words while keeping original meanings.

  • The corpus features dual levels of simplification for differing school grades and includes annotations for readability. It is publicly accessible to aid further text simplification research.

  • The development of the SAMER corpus involved selecting texts of historical significance, using innovative annotation tools for simplicity revisions, and employing skilled native Arabic speakers.

Understanding Arabic Text Simplification for School-aged Learners: Introducing the SAMER Corpus

The Need for Simplified Text

Text simplification is crucial for making content accessible to a diverse audience including children, language learners, and those with cognitive disabilities. Simplification involves rewriting texts in a way that maintains the core meaning but enhances readability through lexical and syntactical changes. While much research has concentrated on English, there's a significant gap in resources for languages like Arabic, especially in creating text materials that cater to specific reader needs.

Introducing the SAMER Corpus

The "Simplification of Arabic Masterpieces for Extensive Reading" (SAMER) project aims to address this gap by creating a manually annotated Arabic corpus specifically targeting school-aged learners. This corpus, derived from 15 Arabic novels spanning a period from the 12th to the 20th century, focuses on lexical simplification—substituting complex words with simpler alternatives while preserving original meanings.

Key Features of the SAMER Corpus

  • Dual Simplification Levels: Each piece in the SAMER Corpus has been simplified to two distinct readability levels—Level 4 (suitable for grades 6-8) and Level 3 (suitable for grades 4-5).
  • Rich Annotations: Annotations include dual simplification levels and readability assessments both at word and document levels.
  • Public Availability: In support of further research and application, the corpus is openly accessible for use in tasks like readability assessment and automated text simplification.

Technical Aspects and Annotation Process

  1. Choosing Texts: Texts were chosen based on historical significance, readability level, and public domain status.
  2. Annotation Tools and Guidelines: An innovative add-on tool was used, enabling annotators to visualize and modify text readability interactively. The tool is integrated with a lexicon defining five readability levels based on word simplicity.
  3. Rigor in Annotation: The project employed native Arabic speakers skilled in linguistics, ensuring that simplifications were accurately aligned with intended readability improvements.

Corpus Statistics and Insights

  • Volume and Composition: The corpus comprises approximately 160,000 words across original and both simplified levels.
  • Lexical Transformations: Analysis shows predominant use of one-to-one word replacements, highlighting a focused approach to simplification rather than broad textual rewrites.
  • Distribution of Readability Adjustments: Words were often lowered by one or two readability levels, demonstrating a granular approach to simplification.

Practical Implications and Future Directions

The SAMER Corpus opens new avenues for developing automated tools that can adapt Arabic text content to different reader competencies. It serves as a foundational model for non-English text simplification efforts and encourages further research into domain-specific simplification strategies. Future work might explore genre expansion, incorporate syntactic simplification, and develop automated readability and simplification models tailored for Arabic.

By providing a structured approach to Arabic text simplification and establishing a publicly available resource, this work significantly contributes to the field, supporting educational technology advancements and promoting linguistic inclusion.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.