Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 160 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection (2004.10643v1)

Published 22 Apr 2020 in cs.CL

Abstract: Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.

Citations (490)

Summary

  • The paper introduces refined annotation guidelines and enhanced dependency representations that unify multilingual syntactic analysis.
  • The paper improves tokenization and morphological feature specification to ensure consistent syntactic annotations across diverse languages.
  • The paper expands the treebank resources and implements enhanced dependencies to boost semantic interpretation and cross-lingual applications.

Universal Dependencies v2: An Overview of Multilingual Treebank Collection Enhancements

The paper "Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection" by Joakim Nivre et al. outlines significant enhancements from Universal Dependencies version 1 to version 2 (UD v2), contributing to the field of multilingual syntactic annotation. The Universal Dependencies (UD) project aims to establish a cross-linguistically consistent treebank annotation schema, facilitating research in parsing and cross-lingual learning across a multitude of languages.

Key Contributions

The core contributions of UD v2 revolve around the refinement of annotation guidelines, introduction of enhanced dependency representations, and expansion of the multilingual treebank.

  • Annotation Scheme: UD v2 enforces a syntactic structure rooted primarily in dependency relations between content words. This is achieved through adjustments in tokenization and morphological annotations, ensuring uniformity yet allowing language-specific adaptations when necessary. UD version 2 also introduces enhanced representations which capture implicit syntactic relations, benefiting downstream tasks in natural language understanding.
  • Morphological and Syntactic Annotation: UD v2 maintains the inventory of universal part-of-speech tags, refining the use of some tags, such as AUX, to encapsulate blending morphosyntactic TAME particles and copula verbs. Syntactic annotations now prioritize predicate-argument structures based primarily on content words rather than function words, facilitating cross-linguistic consistency in syntactic representation.

Major Changes in UD v2

Several pivotal changes in UD version 2 compared to its predecessor include:

  1. Tokenization and Word Segmentation: The relaxation of restrictions on word-internal spaces accommodates syllabic writing systems prevalent in languages like Vietnamese, avoiding distorted syntactic representation by not necessitating multiword tokens for syllables.
  2. Morphological Features: UD v2 expands and refines its set of universal morphological features to better represent linguistic diversity and aligns more closely with the UniMorph project.
  3. Syntactic Relations: Introduction of the obl relation to segregate oblique nominals at the clause level, maintaining consistency between nominal and predicate modifiers. New relations like clf for classifiers and enhancements in conjunction processing have been incorporated.
  4. Enhanced Dependencies: UD v2 also outlines optional enhanced dependencies designed to improve the semantic interpretation of syntactic structures. This includes null nodes for elided predicates, propagation of conjuncts, explicit representation of control and raising, and enriched case information.

Implications and Future Directions

The implications of the UD v2 release are profound for both practical NLP applications and theoretical linguistic research. The standardized yet adaptable framework enables the development of multilingual parsers and promotes rigorous typological research.

Considering future trajectories, ongoing efforts are geared towards broadening the linguistic diversity of the UD treebanks and increasing the volume of data within existing language treebanks. The challenge of maintaining annotation consistency across languages while expanding the dataset remains a significant focus, as the project endeavors to capture the rich typological diversity present in global languages.

In conclusion, UD v2 constitutes a vital step in scaling multilingual NLP, paving the way for continued advancements in cross-linguistic research and applications.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.