SentMix-3L: A Bangla-English-Hindi Code-Mixed Dataset for Sentiment Analysis (2310.18023v2)
Abstract: Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. Several datasets have been build with the goal of training computational models for code-mixing. Although it is very common to observe code-mixing with multiple languages, most datasets available contain code-mixed between only two languages. In this paper, we introduce SentMix-3L, a novel dataset for sentiment analysis containing code-mixed data between three languages Bangla, English, and Hindi. We carry out a comprehensive evaluation using SentMix-3L. We show that zero-shot prompting with GPT-3.5 outperforms all transformer-based models on SentMix-3L.
- Suraiya Alam. 2006. Code-mixing in bangladesh: A case study of non-government white-collar service holders and professionals. Asian affairs, 28(4):52–70.
- Effects of code mixing in indian film songs. Journal of Media Studies, 31(2).
- Fotini Anastassiou. 2017. Factors associated with the code mixing and code-switching of multilingual children: An overview. International Journal of Linguistics, Literature and Culture, 4(3):13–26.
- “i am borrowing ya mixing?" an analysis of english-hindi code mixing in facebook. In Proceedings of CodeSwitch.
- Most Tasnim Begum and Md Mahmudul Haque. 2013. Code mixing in the ksa: A case study of expatriate bangladeshi and indian esl teachers. Arab World English Journal, 4(4).
- BanglaBERT: Language model pretraining and benchmarks for low-resource language understanding evaluation in Bangla. In Findings of the ACL.
- Krista Byers-Heinlein and Casey Lew-Williams. 2013. Bilingualism in the early years: What the science says. LEARNing landscapes, 7(1):95.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of ACL.
- Databricks. 2023. Dolly 2.0: An open source, instruction-following large language model. Accessed: 2023-09-10.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL.
- Reviewing the challenges and opportunities presented by code switching and mixing in bangla. Journal of Education and Practice, 6(1):103–109.
- Didar Hossain and Kapil Bar. 2015. A case study in code-mixing among jahangirnagar university students. International Journal of English and Literature, 6(7):123–139.
- Jill V Jeffery and Catherine van Beuningen. 2020. Language education in the eu and the us: Paradoxes and parallels. Prospects, 48(3-4):175–191.
- A survey of current datasets for code-switching research. In Proceedings of ICACCS.
- Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text. In Proceedings of COLING.
- Indicnlpsuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages. In Findings of the ACL.
- Sentiment analysis of covid-19 vaccination in bangla language with code-mixed text from social media. In Proceedings of ICECE.
- Muril: Multilingual representations for indian languages.
- Bangla-bert: transformer-based efficient model for transfer learning and language understanding. IEEE Access, 10:91855–91870.
- Multilingual code-switching for zero-shot cross-lingual intent prediction and slot filling. In Proceedings of the 1st Workshop on Multilingual Representation Learning.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- Ahmad Mahbub-ul Alam and Shaima Quyyum. 2016. A sociolinguistic survey on code switching & code mixing by the native speakers of bangladesh. Journal of Manarat International University, 6(1):8–9.
- Daniele Mazzocchi. 2012. langdetect: Language detection library. Python library.
- Ravindra Nayak and Raviraj Joshi. 2022. L3Cube-HingCorpus and HingBERT: A code mixed Hindi-English dataset and BERT language models. In Proceedings WILDRE.
- Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of EMNLP.
- Nick Doiron. 2023. hindi-bert (revision aefac8e).
- Jianzhi Nie. 2023. Awesome instruction datasets. Accessed: 2023-09-10.
- OpenAI. 2023. Gpt-3.5 turbo fine-tuning and api updates. Accessed: 2023-08-28.
- Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR, abs/1910.01108.
- Bertologicomix: How does code-mixing interact with multilingual bert? In Proceedings of AdaptNLP.
- Rajendra Singh. 1985. Grammatical constraints on code-mixing: Evidence from hindi-english. Canadian Journal of Linguistics/Revue canadienne de linguistique, 30(1):33–45.
- Sentiment analysis of mixed language employing hindi-english code switching. In Proceedings of ICMLC.
- Data-augmentation for bangla-english code-mixed sentiment analysis: Enhancing cross linguistic contextual understanding. IEEE Access.
- Code-mixing: A brief survey. In Proceedings of ICACCI.
- Bi-lstm and ensemble based bilingual sentiment analysis for a code-mixed hindi-english social media text. In Proceedings of INDICON.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.