MIND Your Language: A Multilingual Dataset for Cross-lingual News Recommendation (2403.17876v1)
Abstract: Digital news platforms use news recommenders as the main instrument to cater to the individual information needs of readers. Despite an increasingly language-diverse online community, in which many Internet users consume news in multiple languages, the majority of news recommendation focuses on major, resource-rich languages, and English in particular. Moreover, nearly all news recommendation efforts assume monolingual news consumption, whereas more and more users tend to consume information in at least two languages. Accordingly, the existing body of work on news recommendation suffers from a lack of publicly available multilingual benchmarks that would catalyze development of news recommenders effective in multilingual settings and for low-resource languages. Aiming to fill this gap, we introduce xMIND, an open, multilingual news recommendation dataset derived from the English MIND dataset using machine translation, covering a set of 14 linguistically and geographically diverse languages, with digital footprints of varying sizes. Using xMIND, we systematically benchmark several state-of-the-art content-based neural news recommenders (NNRs) in both zero-shot (ZS-XLT) and few-shot (FS-XLT) cross-lingual transfer scenarios, considering both monolingual and bilingual news consumption patterns. Our findings reveal that (i) current NNRs, even when based on a multilingual LLM, suffer from substantial performance losses under ZS-XLT and that (ii) inclusion of target-language data in FS-XLT training has limited benefits, particularly when combined with a bilingual news consumption. Our findings thus warrant a broader research effort in multilingual and cross-lingual news recommendation. The xMIND dataset is available at https://github.com/andreeaiana/xMIND.
- Neural news recommendation with long-and short-term user representations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 336–345.
- Neural machine translation by jointly learning to align and translate. ICLR (2014).
- Jack M Balkin. 2017. Free speech in the algorithmic society: Big data, private governance, and new school speech regulation. UCDL rev. 51 (2017), 1149.
- Emily Bender. 2019. The# benderrule: On naming the languages we study and why it matters. The Gradient 14 (2019).
- The netflix prize. In Proceedings of KDD cup and workshop, Vol. 2007. New York, 35.
- Translating embeddings for modeling multi-relational data. In Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2. 2787–2795. https://dl.acm.org/doi/abs/10.5555/2999792.2999923
- Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724–1734. https://doi.org/10.3115/v1/D14-1179
- Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8440–8451.
- Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 7059–7069.
- No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672 (2022).
- A survey of multilingual neural machine translation. ACM Computing Surveys (CSUR) 53, 5 (2020), 1–38.
- Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3608–3626.
- News session-based recommendations using deep neural networks. In Proceedings of the 3rd Workshop on Deep Learning for Recommender Systems. 15–23.
- Matthew S. Dryer and Martin Haspelmath (Eds.). 2013. WALS Online (v2020.3). Zenodo. https://doi.org/10.5281/zenodo.7385533
- Beyond english-centric multilingual machine translation. Journal of Machine Learning Research 22, 107 (2021), 1–48.
- Contextual hybrid session-based news recommendation with recurrent neural networks. IEEE Access 7 (2019), 169185–169203.
- The adressa dataset for news recommendation. In Proceedings of the international conference on web intelligence. 1042–1048.
- Few-shot News Recommendation via Cross-lingual Transfer. In Proceedings of the ACM Web Conference 2023. 1130–1140.
- Survey of low-resource machine translation. Computational Linguistics 48, 3 (2022), 673–732.
- glottolog/glottolog: Glottolog database 4.4.
- F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2015), 1–19.
- Natali Helberger. 2021. On the democratic role of news recommenders. In Algorithms, Automation, and News. Routledge, 14–33.
- NeMig-A Bilingual News Collection and Knowledge Graph about Migration. In Proceedings of the Workshop on News Recommendation and Analytics co-located with RecSys 2023.
- A survey on knowledge-aware news recommender systems. Semantic Web Preprint ([n. d.]), 1–62.
- NewsRecLib: A PyTorch-Lightning Library for Neural News Recommendation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 296–310.
- Simplifying content-based neural news recommendation: On user modeling and training objectives. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2384–2388.
- Train once, use flexibly: A modular framework for multi-aspect neural news recommendation. arXiv preprint arXiv:2307.16089 (2023).
- Junxiang Jiang. 2023. TADI: Topic-aware Attention and Powerful Dual-encoder Interaction for Recall in News Recommendation. In Findings of the Association for Computational Linguistics: EMNLP 2023. 15647–15658.
- The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6282–6293.
- Supervised contrastive learning. Advances in neural information processing systems 33 (2020), 18661–18673.
- The plista dataset. In Proceedings of the 2013 international news recommender systems workshop and challenge. 16–23.
- Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, 1746–1751. https://doi.org/10.3115/v1/D14-1181
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. ICLR (2014).
- Klaus Krippendorff. 2013. Content analysis: An introduction to its methodology. Sage publications.
- MADLAD-400: A Multilingual And Document-Level Large Audited Dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4483–4499.
- Ethnologue: languages of the world, Dallas, Texas: SIL International. Online version: http://www. ethnologue. com 12, 12 (2009), 2010.
- MINER: Multi-interest matching network for news recommendation. In Findings of the Association for Computational Linguistics: ACL 2022. 343–352.
- Miaomiao Li and Licheng Wang. 2019. A survey on personalized news recommendation technology. IEEE Access 7 (2019), 145861–145879.
- PBNR: Prompt-based News Recommender System. arXiv preprint arXiv:2304.07862 (2023).
- Multilingual news–an investigation of consumption, querying, and search result selection behaviors. International Journal of Human–Computer Interaction 36, 6 (2020), 516–535.
- Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 8–14.
- KRED: Knowledge-aware document representation for news recommendations. In Proceedings of the 14th ACM Conference on Recommender Systems. 200–209. https://doi.org/10.1145/3383313.3412237
- NPR: a News Portal Recommendations dataset. In Proceedings of the The First Workshop on the Normative Design and Evaluation of Recommender Systems (NORMalize 2023), co-located with the ACM Conference on Recommender Systems 2023 (RecSys 2023).
- Eli Pariser. 2011. The filter bubble: What the Internet is hiding from you. Penguin UK.
- POTATO: The Portable Text Annotation Tool. In Proceedings of the The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 327–337.
- XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2362–2376.
- Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor (Eds.). Association for Computational Linguistics, Brussels, Belgium, 186–191. https://doi.org/10.18653/v1/W18-6319
- Personalized news recommendation with knowledge-aware interactive matching. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 61–70. https://doi.org/10.1145/3404835.3462861
- PP-Rec: News Recommendation with Personalized User Interest and Time-aware News Popularity. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 5457–5467. https://doi.org/10.18653/v1/2021.acl-long.424
- FUM: fine-grained and fast user modeling for news recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1974–1978.
- News recommendation with candidate-aware user modeling. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1917–1921.
- Privacy-Preserving News Recommendation Model Learning. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1423–1432. https://doi.org/10.18653/v1/2020.findings-emnlp.128
- HieRec: Hierarchical User Interest Modeling for Personalized News Recommendation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 5446–5456. https://doi.org/10.18653/v1/2021.acl-long.423
- Don’t stop fine-tuning: On training regimes for few-shot cross-lingual transfer with multilingual language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 10725–10742.
- DCAN: Diversified news recommendation with coverage-attentive networks. arXiv preprint arXiv:2206.02627 (2022). https://doi.org/10.48550/arXiv.2206.02627
- Jörg Tiedemann. 2012. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (23-25), Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Ugur Dogan, Bente Maegaard, Joseph Mariani, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association (ELRA), Istanbul, Turkey.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000–6010. https://dl.acm.org/doi/abs/10.5555/3295222.3295349
- DKN: Deep knowledge-aware network for news recommendation. In Proceedings of the 2018 world wide web conference. 1835–1844. https://doi.org/10.1145/3178876.3186175
- News recommendation via multi-interest news sequence modelling. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7942–7946.
- Polylm: An open source polyglot large language model. arXiv preprint arXiv:2307.06018 (2023).
- On Learning Universal Representations Across Languages. In International Conference on Learning Representations.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
- Neural news recommendation with attentive multi-view learning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 3863–3869. https://doi.org/10.24963/ijcai.2019/536
- NPA: neural news recommendation with personalized attention. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 2576–2584. https://doi.org/10.1145/3292500.3330665
- Neural news recommendation with topic-aware news representation. In Proceedings of the 57th Annual meeting of the association for computational linguistics. 1154–1159. https://doi.org/10.18653/v1/P19-1110
- Neural news recommendation with multi-head self-attention. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 6389–6394. https://doi.org/10.18653/v1/D19-1671
- Rethinking InfoNCE: How Many Negative Samples Do You Need?. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Lud De Raedt (Ed.). International Joint Conferences on Artificial Intelligence Organization, 2509–2515. https://doi.org/10.24963/ijcai.2022/348
- Personalized news recommendation: Methods and challenges. ACM Transactions on Information Systems 41, 1 (2023), 1–50.
- SentiRec: Sentiment diversity-aware neural news recommendation. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. 44–53. https://aclanthology.org/2020.aacl-main.6
- Empowering news recommendation with pre-trained language models. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1652–1656.
- End-to-end Learnable Diversity-aware News Recommendation. arXiv preprint arXiv:2204.00539 (2022). https://doi.org/10.48550/arXiv.2204.00539
- Removing AI’s sentiment manipulation of personalized news delivery. Humanities and Social Sciences Communications 9, 1 (2022), 1–9.
- Mind: A large-scale dataset for news recommendation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 3597–3606.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
- mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 483–498.
- Tiny-NewsRec: Effective and Efficient PLM-based News Recommendation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 5478–5489. https://aclanthology.org/2022.emnlp-main.368
- Zizhuo Zhang and Bang Wang. 2023. Prompt learning for news recommendation. arXiv preprint arXiv:2304.05263 (2023).
- Ethan Zuckerman. 2008. The polyglot internet. (2008). https://ethanzuckerman.com/the-polyglot-internet/