101 Billion Arabic Words Dataset (2405.01590v1)
Abstract: In recent years, LLMs have revolutionized the field of natural language processing, showcasing an impressive rise predominantly in English-centric domains. These advancements have set a global benchmark, inspiring significant efforts toward developing Arabic LLMs capable of understanding and generating the Arabic language with remarkable accuracy. Despite these advancements, a critical challenge persists: the potential bias in Arabic LLMs, primarily attributed to their reliance on datasets comprising English data that has been translated into Arabic. This reliance not only compromises the authenticity of the generated content but also reflects a broader issue -the scarcity of original quality Arabic linguistic data. This study aims to address the data scarcity in the Arab world and to encourage the development of Arabic LLMs that are true to both the linguistic and nuances of the region. We undertook a large-scale data mining project, extracting a substantial volume of text from the Common Crawl WET files, specifically targeting Arabic content. The extracted data underwent a rigorous cleaning and deduplication process, using innovative techniques to ensure the integrity and uniqueness of the dataset. The result is the 101 Billion Arabic Words Dataset, the largest Arabic dataset available to date, which can significantly contribute to the development of authentic Arabic LLMs. This study not only highlights the potential for creating linguistically and culturally accurate Arabic LLMs but also sets a precedent for future research in enhancing the authenticity of Arabic LLMs.
- Harika Abburi et al. Generative ai text classification using ensemble llm approaches. arXiv preprint arXiv:2309.07755, 2023.
- Arabicaqa: A comprehensive dataset for arabic question answering. arXiv preprint arXiv:2403.17848, 2024.
- Ahmed Abdelali et al. Larabench: Benchmarking arabic ai with large language models. 2024.
- Aramus: Pushing the limits of data and model scale for arabic natural language processing. arXiv preprint arXiv:2306.06800, 2023.
- ARBML: Democritizing Arabic natural language processing tools. In Eunjeong L. Park, Masato Hagiwara, Dmitrijs Milajevs, Nelson F. Liu, Geeticka Chauhan, and Liling Tan (eds.), Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pp. 8–13, Online, November 2020. Association for Computational Linguistics. 10.18653/v1/2020.nlposs-1.2. URL https://aclanthology.org/2020.nlposs-1.2.
- Cidar: Culturally relevant instruction dataset for arabic, 2024.
- Alaaeldin El-Nouby et al. Are large-scale datasets necessary for self-supervised pre-training? 2021.
- William Held et al. A material lens on coloniality in nlp. arXiv preprint arXiv:2311.08391, 2023.
- Hanlei Jin et al. A comprehensive survey on process-oriented automatic text summarization with exploration of llm-based methods. arXiv preprint arXiv:2403.02901, 2024.
- Unsung challenges of building and deploying language technologies for low resource language communities. pp. 211–219, 2019. URL https://aclanthology.org/2019.icon-1.25.
- Vincent Jung and Lonneke van der Plas. Understanding the effects of language-specific class imbalance in multilingual fine-tuning. 2024.
- Thomas Lancaster. A large language model supported synthesis of contemporary academic integrity research trends. 2024.
- Zhenyu Li et al. Flexkbqa: A flexible llm-powered framework for few-shot knowledge base question answering. 38(17), 2024.
- Low-resource languages: A review of past work and future challenges. arXiv preprint arXiv:2006.07264, 2020.
- Frontiers in linguistic annotation for lower-density languages. 2006.
- Masayasu Muraoka et al. Cross-lingual transfer of large language model by visually-derived supervision toward low-resource languages. 2023.
- CAMeL tools: An open source python toolkit for Arabic natural language processing. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 7022–7032, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.868.
- Arquad: An expert-annotated arabic machine reading comprehension dataset. Cognitive Computation, pp. 1–20, 2024.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- Data cards: Purposeful and transparent dataset documentation for responsible ai. 2022. 10.1145/3531146.3533231. URL https://doi.org/10.1145/3531146.3533231.
- Nathaniel R. Robinson et al. Chatgpt mt: Competitive for high-(but not low-) resource languages. arXiv preprint arXiv:2309.07423, 2023.
- Neha Sengupta and et al. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. 2023.
- Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models. arXiv preprint arXiv:2308.16149, 2023.
- Shivalika Singh and et al. Aya dataset: An open-access collection for multilingual instruction tuning. 2024.
- Low resource arabic dialects transformer neural machine translation improvement through incremental transfer of shared linguistic features. Arabian Journal for Science and Engineering, pp. 1–17, 2024.
- Karthik Valmeekam et al. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498, 2022.
- Xinyi Wang et al. Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning. 2023.
- The kind dataset: A social collaboration approach for nuanced dialect data collection. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 32–43, 2024.
- Yifan Yao et al. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. pp. 100211, 2024.