Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 105 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 427 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Vyākarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages (2103.00854v3)

Published 1 Mar 2021 in cs.CL

Abstract: While there has been significant progress towards developing NLU resources for Indic languages, syntactic evaluation has been relatively less explored. Unlike English, Indic languages have rich morphosyntax, grammatical genders, free linear word-order, and highly inflectional morphology. In this paper, we introduce Vy=akarana: a benchmark of Colorless Green sentences in Indic languages for syntactic evaluation of multilingual LLMs. The benchmark comprises four syntax-related tasks: PoS Tagging, Syntax Tree-depth Prediction, Grammatical Case Marking, and Subject-Verb Agreement. We use the datasets from the evaluation tasks to probe five multilingual LLMs of varying architectures for syntax in Indic languages. Due to its prevalence, we also include a code-switching setting in our experiments. Our results show that the token-level and sentence-level representations from the Indic LLMs (IndicBERT and MuRIL) do not capture the syntax in Indic languages as efficiently as the other highly multilingual LLMs. Further, our layer-wise probing experiments reveal that while mBERT, DistilmBERT, and XLM-R localize the syntax in middle layers, the Indic LLMs do not show such syntactic localization.

Citations (1)

Summary

  • The paper introduces the Vyākarana benchmark that evaluates Indic language models on tasks such as PoS tagging, syntax tree-depth, grammatical case marking, and subject-verb agreement.
  • It leverages semantically nonsensical but syntactically valid 'Colorless Green' sentences to isolate and test models’ syntactic comprehension.
  • The study finds that, despite training on Indic texts, models like IndicBERT underperform compared to multilingual models, highlighting challenges in code-switched and complex morphosyntactic environments.

Vyākarana: A Benchmark for Syntactic Evaluation in Indic Languages

The paper “Vyākarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages” addresses a gap in syntactic evaluation resources for Indic languages within natural language understanding (NLU). The focus is on building a comprehensive syntactic evaluation benchmark specifically tailored for Indic languages, utilizing a collection of Colorless Green sentences. Given that Indic languages exhibit complex linguistic properties such as rich morphosyntax, grammatical genders, and varied word orders, they present unique challenges not well captured by existing multilingual LLMs designed primarily for Indo-European languages.

Key Contributions

  1. Vyākarana Benchmark: The primary contribution of the paper is the introduction of the Vyākarana benchmark, specifically targeting syntactic evaluation through four linguistically rich, syntax-related tasks:
    • Part of Speech (PoS) Tagging
    • Syntax Tree-Depth Prediction
    • Grammatical Case Marking
    • Subject-Verb Agreement
  2. Colorless Green Sentences: The benchmark leverages “Colorless Green” sentences, which are syntactically valid but semantically nonsensical, ensuring that model evaluations focus purely on syntactic comprehension rather than semantic cues.
  3. Multilingual Context: The evaluation includes both monolingual (Indic languages) and code-switched datasets (Indic languages mixed with English). This recognizes the prevalence of code-switching in South Asian linguistic communities, adding another layer of complexity and realism to the evaluations.
  4. Model Evaluation: Five multilingual LLMs were assessed, including IndicBERT and MuRIL, alongside more widely used models like mBERT, DistilmBERT, and XLM-R. The evaluation focused on their ability to capture syntactic structures within Indic languages.

Findings

The paper reveals several key insights about the performance of existing models on syntactic tasks in Indic languages:

  • Syntactic Localization Deficiency: Indic-specific models (IndicBERT and MuRIL) exhibit a lack of syntactic information localization compared to mBERT, XLM-R, and DistilmBERT, which localize such information in the middle layers of their architectures.
  • Performance Limitations: Despite being trained on Indic texts, IndicBERT and MuRIL underperformed compared to highly multilingual models, indicating a need for improved architecture or training regimens that better capture Indic syntactic properties.
  • Code-switch Impact: LLMs faced significant challenges in the code-switched environment, highlighting the necessity for additional research and improved training on code-switched corpora to improve model robustness in handling such common linguistic phenomena.

Implications and Future Directions

This research contributes to a greater understanding of how current LLMs perform on syntactically complex Indic languages. It underscores the necessity for more specialized approaches in training models that can handle the unique syntactic and morphosyntactic challenges presented by these languages.

Future work should explore alternative training methodologies or architectural modifications that can enhance the capture of Indic-language-specific syntactic nuances. Furthermore, expanding this benchmark to cover more languages could provide a richer dataset for refining Indic NLP tools.

Overall, this paper lays a foundational groundwork for more syntactically focused evaluation and model development in multilingual and code-switched language contexts, crucial for advancing natural language processing technologies in linguistically diverse regions like South Asia.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com