Multi-FAct: Assessing Factuality of Multilingual LLMs using FActScore (2402.18045v3)
Abstract: Evaluating the factuality of long-form LLM-generated text is an important challenge. Recently there has been a surge of interest in factuality evaluation for English, but little is known about the factuality evaluation of multilingual LLMs, specially when it comes to long-form generation. %This paper systematically evaluates multilingual LLMs' factual accuracy across languages and geographic regions. We introduce a simple pipeline for multilingual factuality evaluation, by applying FActScore (Min et al., 2023) for diverse languages. In addition to evaluating multilingual factual generation, we evaluate the factual accuracy of long-form text generation in topics that reflect regional diversity. We also examine the feasibility of running the FActScore pipeline using non-English Wikipedia and provide comprehensive guidelines on multilingual factual evaluation for regionally diverse topics.
- mface: Multilingual summarization with factual consistency evaluation. arXiv preprint arXiv:2212.10622.
- MEGA: Multilingual evaluation of generative AI. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4232–4267, Singapore. Association for Computational Linguistics.
- Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when it’s lying.
- A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 675–718, Nusa Dua, Bali. Association for Computational Linguistics.
- Multilingual large language models leak human stereotypes across language boundaries.
- Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528.
- Chain-of-verification reduces hallucination in large language models.
- Olmo: Accelerating the science of language models.
- A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206.
- Ashim Gupta and Vivek Srikumar. 2021. X-fact: A new benchmark dataset for multilingual fact checking. arXiv preprint arXiv:2106.09248.
- Challenges and strategies in cross-cultural NLP. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6997–7013, Dublin, Ireland. Association for Computational Linguistics.
- Dirk Hovy and Diyi Yang. 2021. The importance of modeling social factors of language: Theory and practice. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 588–602, Online. Association for Computational Linguistics.
- Jing Huang and Diyi Yang. 2023. Culturally aware natural language inference. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7591–7609, Singapore. Association for Computational Linguistics.
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 1(10).
- Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries.
- Comparing hallucination detection metrics for multilingual generation. arXiv preprint arXiv:2402.10496.
- Proofver: Natural logic theorem proving for fact verification. Transactions of the Association for Computational Linguistics, 10:1013–1030.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.
- Large language models are geographically biased. arXiv preprint arXiv:2402.02680.
- Factscore: Fine-grained atomic evaluation of factual precision in long form text generation.
- Nonparametric masked language modeling. arXiv preprint arXiv:2212.01349.
- Global-liar: Factuality of llms over time and geographic regions. arXiv preprint arXiv:2401.17839.
- Fine-grained hallucination detection and editing for language models. arXiv preprint arXiv:2401.06855.
- Dolma: an open corpus of three trillion tokens for language model pretraining research.
- FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
- Wikipedia contributors. 2024. Wikipedia:multilingual statistics — Wikipedia, the free encyclopedia. [Online; accessed 24-February-2024].
- Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
- Reasoning over semantic-level graph for fact checking. arXiv preprint arXiv:1909.03745.