Patterns of Persistence and Diffusibility across the World's Languages (2401.01698v2)
Abstract: Language similarities can be caused by genetic relatedness, areal contact, universality, or chance. Colexification, i.e. a type of similarity where a single lexical form is used to convey multiple meanings, is underexplored. In our work, we shed light on the linguistic causes of cross-lingual similarity in colexification and phonology, by exploring genealogical stability (persistence) and contact-induced change (diffusibility). We construct large-scale graphs incorporating semantic, genealogical, phonological and geographical data for 1,966 languages. We then show the potential of this resource, by investigating several established hypotheses from previous work in linguistics, while proposing new ones. Our results strongly support a previously established hypothesis in the linguistic literature, while offering contradicting evidence to another. Our large scale resource opens for further research across disciplines, e.g.~in multilingual NLP and comparative linguistics.
- MAD-G: Multilingual adapter generation for efficient cross-lingual transfer. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4762–4781, Association for Computational Linguistics, Punta Cana, Dominican Republic.
- Bickel, Balthasar and Johanna Nichols. 2006. Oceania, the pacific rim, and the theory of linguistic areas. In Annual Meeting of the Berkeley Linguistics Society, volume 32, pages 3–15.
- The autotyp database.
- Bjerva, Johannes. 2023. The role of typological feature prediction in NLP and linguistics. Computational Linguistics, pages 1–13.
- Bjerva, Johannes and Isabelle Augenstein. 2018a. From phonology to syntax: Unsupervised linguistic typology at different levels with language embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 907–916.
- Bjerva, Johannes and Isabelle Augenstein. 2018b. Tracking typological traits of uralic languages in distributed language representations. In Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages, pages 76–86.
- Bjerva, Johannes and Isabelle Augenstein. 2021. Does typological blinding impede cross-lingual sharing? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 480–486, Association for Computational Linguistics, Online.
- A probabilistic generative model of linguistic typology. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1529–1540, Association for Computational Linguistics, Minneapolis, Minnesota.
- Uncovering probabilistic implications in typological knowledge bases. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3924–3930, Association for Computational Linguistics, Florence, Italy.
- SIGTYP 2020 shared task: Prediction of typological features. In Proceedings of the Second Workshop on Computational Research in Linguistic Typology, pages 1–11, Association for Computational Linguistics, Online.
- Brochhagen, Thomas and Gemma Boleda. 2022. When do languages use the same word for different meanings? the goldilocks principle in colexification. Cognition, 226:105179.
- Brown, Cecil H. 2011. The role of nahuatl in the formation of mesoamerica as a linguistic area. Language Dynamics and Change, 1(2):171–204.
- Concreteness ratings for 40 thousand generally known english word lemmas. Behavior research methods, 46:904–911.
- The causality of borrowing: Lexical loans in eurasian languages. PloS one, 14(10):e0223588.
- Colex2Lang: Language embeddings from semantic typology. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 673–684, University of Tartu Library, Tórshavn, Faroe Islands.
- Chen, Yiyi and Johannes Bjerva. 2023. Colexifications for bootstrapping cross-lingual datasets: The case of phonology, concreteness, and affectiveness. arXiv preprint arXiv:2306.02646.
- On the complexity and typology of inflectional morphological systems. Transactions of the Association for Computational Linguistics, 7:327–342.
- Cysouw, Michael. 2013. Disentangling geography from genealogy. In Space in language and linguistics: Geographical, interactional, and cognitive perspectives. de Gruyter.
- Darquennes, Jeroen. 2006. Thomason, sarah grey (2001). language contact. edinburgh: Edinburgh university press; myers-scotton, carol (2002). contact linguistics. oxford: Oxford university press; winford, donald (2003). an introduction to contact linguistics. oxford: Blackwell; clyne michael. 2003. dynamics of language contact. cambridge: Cambridge university press. Sociolinguistica, 20:191–196.
- Colexification networks encode affective meaning. Affective Science, 2(2):99–111.
- Dryer, Matthew S. 1989. Large linguistic areas and language sampling. Studies in Language. International Journal sponsored by the Foundation “Foundations of Language”, 13(2):257–292.
- Dryer, Matthew S. 1992. The greenbergian word order correlations. Language, 68(1):81–138.
- Dryer, Matthew S. 2018. On the order of demonstrative, numeral, adjective, and noun. Language, 94(4):798–833.
- WALS Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.
- Fekete, Marcell Richard and Johannes Bjerva. 2023. Gradual language model adaptation using fine-grained typology. In Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 153–158, Association for Computational Linguistics, Dubrovnik, Croatia.
- François, Alexandre. 2008. Semantic maps and the typology of colexification. From polysemy to semantic change: Towards a typology of lexical semantic associations, (106):163.
- François, Alexandre. 2022. Lexical tectonics: Mapping structural change in patterns of lexification. Zeitschrift für Sprachwissenschaft, 41(1):89–123.
- Contextual characteristics of concrete and abstract words. In IWCS 2017 — 12th International Conference on Computational Semantics — Short papers.
- Gast, Volker and Maria Koptjevskaja-Tamm. 2022. Patterns of persistence and diffusibility in the european lexicon. Linguistic Typology, 26(2):403–438.
- Guasch, Marc and Pilar Ferré. 2021. Emotion and concreteness effects when learning novel concepts in the native language. Psicológica, 42(2).
- Hammarström, Harald and Sebastian Nordhoff. 2011. Langdoc: Bibliographic infrastructure for linguistic typology. Oslo Studies in Language, 3(2).
- glottolog/glottolog: Glottolog database 4.7.
- Identifying semantic role clusters and alignment types via microrole coexpression tendencies. Studies in Language. International Journal sponsored by the Foundation “Foundations of Language”, 38(3):463–484.
- Haspelmath, Martin. 2003. The geometry of grammatical meaning: Semantic maps and cross-linguistic comparison. In The new psychology of language. Psychology Press, pages 217–248.
- Haspelmath, Martin. 2009. Lexical borrowing: Concepts and issues. Loanwords in the world’s languages: A comparative handbook, pages 35–54.
- Haugen, Einar. 1950. The analysis of linguistic borrowing. Language, 26(2):210–231.
- Hayward, Richard J. 1991. A propos patterns of lexicalization in the ethiopian language area. Ägypten im afroorientalischen Kontext. Special issue of Afrikanistische Arbeitspapiere, pages 139–156.
- Hayward, Richard J. 2000. Is there a metric for convergence. Time depth in historical linguistics, 2:621–640.
- Heine, Bernd and Tania Kuteva. 2003. On contact-induced grammaticalization. Studies in Language. International Journal sponsored by the Foundation “Foundations of Language”, 27(3):529–572.
- A quantitative empirical analysis of the abstract/concrete distinction. Cognitive science, 38(1):162–177.
- Explorations in automated language classification.
- Emotion semantics show both cultural variation and universal structure. Science, 366(6472):1517–1522.
- Mixed effect models for genetic and areal dependencies in linguistic typology.
- Jäger, Gerhard. 2018. Global-scale phylogenetic linguistic inference from lexical resources. Scientific Data, 5(1):1–16.
- Johnson, Keith. 2008. Quantitative methods in linguistics. John Wiley & Sons.
- Conceptual similarity and communicative need shape colexification: An experimental study. Cognitive Science, 45(9):e13035.
- Koptjevskaja-Tamm, Maria. 2011. Linguistic typology and language contact. In Jae Jung Song, editor, The Oxford handbook of linguistic typology. Oxford University Press, Oxford, chapter 10, page 504–533.
- Koptjevskaja-Tamm, Maria and Henrik Liljegren. 2017. Semantic Patterns from an Areal Perspective, Cambridge Handbooks in Language and Linguistics. Cambridge University Press.
- Kuteva, Tania. 2017. Contact and borrowing. The Cambridge handbook of historical syntax, pages 163–186.
- Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 8–14, Association for Computational Linguistics.
- Crosslingual transfer learning for low-resource languages based on multilingual colexification graphs. arXiv preprint arXiv:2305.12818.
- A crosslingual investigation of conceptualization in 1335 languages. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada.
- Learning language representations for typology prediction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2529–2535, Association for Computational Linguistics, Copenhagen, Denmark.
- Matisoff, James A. 2001. Genetic versus contact relationship: prosodic diffusibility in south-east asian languages. Areal diffusion and genetic inheritance: Problems in comparative linguistics, pages 291–327.
- Matras, Yaron and Jeanette Sakel. 2007. Investigating the mechanisms of pattern replication in language convergence. Studies in Language. International Journal sponsored by the Foundation “Foundations of Language”, 31(4):829–865.
- Mayer, Thomas and Michael Cysouw. 2014. Creating a massively parallel Bible corpus. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 3158–3163, European Language Resources Association (ELRA), Reykjavik, Iceland.
- Sampling for variety. Linguistic Typology, 20(2):233–296.
- Quantitative semantic variation in the contexts of concrete and abstract words. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 76–85, Association for Computational Linguistics, New Orleans, Louisiana.
- Nichols, Johanna and Balthasar Bickel. 2009. The autotyp genealogy and geography database: 2009 release. URL: https://github.com/autotyp/autotyp-data.
- Ross, Malcolm. 2001. Contact-induced change in oceanic languages in north-west. Areal diffusion and genetic inheritance: Problems in comparative linguistics, 134.
- Ross, Malcolm. 2007. Calquing and metatypy. Journal of language contact, 1(1):116–143.
- The database of cross-linguistic colexifications, reproducible analysis of cross-linguistic polysemies. Scientific data, 7(1):1–12.
- Schapper, Antoinette and Maria Koptjevskaja-Tamm. 2022. Introduction to special issue on areal typology of lexico-semantics. Linguistic Typology, 26(2):199–209.
- Schwanenflugel, Paula J. 2013. Why are abstract concepts hard to understand? In The psychology of word meanings. Psychology Press, pages 235–262.
- Smith-Stark, Thomas C. 1994. Mesoamerican calques. Investigaciones lingüísticas en Mesoamérica, 15:50.
- Same neurons, different languages: Probing morphosyntax in multilingual pre-trained models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1589–1598, Association for Computational Linguistics, Seattle, United States.
- Cross-cultural similarity features for cross-lingual transfer learning of pragmatically motivated tasks. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2403–2414, Association for Computational Linguistics, Online.
- Swadesh, Morris. 1950. Salish internal relationships. International Journal of American Linguistics, 16(4):157–167.
- Swadesh, Morris. 1955. Towards greater accuracy in lexicostatistic dating. International journal of American linguistics, 21(2):121–137.
- Borrowability and the notion of basic vocabulary. Diachronica, 27(2):226–246.
- Norms of valence, arousal, and dominance for 13,915 english lemmas. Behavior research methods, 45:1191–1207.
- Wichmann, Søren and Eric W Holman. 2009. Assessing temporal stability for linguistic typological features. München: LINCOM Europa.
- The asjp database (version 20).
- Conceptual relations predict colexification across languages. Cognition, 201:104280.
- Östling, Robert and Murathan Kurfalı. 2023. Language Embeddings Sometimes Contain Typological Generalizations. Computational Linguistics, pages 1–49.
- Yiyi Chen (17 papers)
- Johannes Bjerva (52 papers)