Emergent Mind

Are Sounds Sound for Phylogenetic Reconstruction?

(2402.02807)
Published Feb 5, 2024 in cs.CL , cs.SD , and eess.AS

Abstract

In traditional studies on language evolution, scholars often emphasize the importance of sound laws and sound correspondences for phylogenetic inference of language family trees. However, to date, computational approaches have typically not taken this potential into account. Most computational studies still rely on lexical cognates as major data source for phylogenetic reconstruction in linguistics, although there do exist a few studies in which authors praise the benefits of comparing words at the level of sound sequences. Building on (a) ten diverse datasets from different language families, and (b) state-of-the-art methods for automated cognate and sound correspondence detection, we test, for the first time, the performance of sound-based versus cognate-based approaches to phylogenetic reconstruction. Our results show that phylogenies reconstructed from lexical cognates are topologically closer, by approximately one third with respect to the generalized quartet distance on average, to the gold standard phylogenies than phylogenies reconstructed from sound correspondences.

Cognate words' encoding and their evolution modeled via gain-loss processes on a phylogenetic tree.

Overview

  • The paper investigates the effectiveness of using sound correspondences versus lexical cognates for the reconstruction of language phylogenies through computational methods.

  • State-of-the-art automated cognate detection and novel sound correspondence inference techniques are employed on ten gold standard datasets, comparing results to benchmark trees from anthropological linguistics.

  • Results show that cognate-based phylogenies are on average one third closer to expert-derived trees than sound-based phylogenies when measured by generalized quartet distance.

  • The study concludes cognate-based approaches are more reliable for phylogenetic reconstruction and calls for revised prior settings in Bayesian language data analysis.

Introduction

The study of language evolution and the methods used to reconstruct phylogenetic relationships among languages have progressed significantly with the advent of computational techniques. While lexical cognates have commonly been the mainstay of such studies, sound correspondences have often been cited by historical linguists for subgrouping languages. Moreover, the debate on which data type—lexical cognates or sound sequences—better supports the accurate reconstruction of language phylogenies remains open. In addressing this debate, the paper in question rigorously evaluates the efficacy of sound-based vs. cognate-based approaches for phylogenetic reconstruction using a collection of diverse language datasets.

Methodology

Integral to this research are state-of-the-art methods for automated cognate detection alongside novel techniques for sound correspondence pattern inference in multilingual datasets. The utilization of such methods is set against ten specially curated gold standard datasets. Phylogenetic trees inferred from these methods were compared to benchmark trees established by the anthropological linguistic community. The paper explores two inferential approaches: Bayesian Inference and Maximum Likelihood (ML). The analysis is comprehensive, employing both an individual examination of cognate and sound correspondence datasets and a combination of both. The Bayesian analyses were particularly attentive to the prior settings for α values, revealing notable discrepancies when using default molecular priors, thereby prompting a critical reassessment of these priors for language data analysis.

Results

One of the notable outcomes is that phylogenies reconstructed from lexical cognates were, on average, topologically closer by approximately one third with respect to generalized quartet distance (GQD) to expert-derived classification trees, compared to those inferred from sound correspondences. Furthermore, the findings indicate that Bayesian Inference and ML analyses produced consistent results—supporting the effectiveness of cognate-based over sound-based reconstructions, while the combined dataset did not conclusively outperform the cognate dataset. The study also underscores an unusual bi-modal distribution of α values (indicating the degree of among-site rate heterogeneity) not typically observed in molecular datasets, suggesting unique challenges associated with linguistic data.

Discussion

The research presents a nuanced view on the utility of computational tools for language phylogeny. It concludes that while sound correspondence-based phylogenies cannot be disregarded, they appear to be less reliable than cognate-based phylogenies. The study delivers a clear verdict in favor of lexical cognates as a stronger basis for phylogenetic reconstruction. Additionally, the paper exhorts further scrutiny on the customarily used priors in Bayesian analyses of language data, invigorating a methodological dialogue that extends beyond the study of languages and into the implications for computational and evolutionary biology. The supplementary material attached to the study ensures reproducibility, reflecting the transparent and structured character of the research. Overall, the findings are instrumental in informing the direction of future studies bridging computational linguistics and historical linguistics.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.