A Comparison of natural (english) and artificial (esperanto) languages. A Multifractal method based analysis (0801.2510v1)

Published 16 Jan 2008 in cs.CL and physics.data-an

Abstract: We present a comparison of two english texts, written by Lewis Carroll, one (Alice in wonderland) and the other (Through a looking glass), the former translated into esperanto, in order to observe whether natural and artificial languages significantly differ from each other. We construct one dimensional time series like signals using either word lengths or word frequencies. We use the multifractal ideas for sorting out correlations in the writings. In order to check the robustness of the methods we also write the corresponding shuffled texts. We compare characteristic functions and e.g. observe marked differences in the (far from parabolic) f(alpha) curves, differences which we attribute to Tsallis non extensive statistical features in the ''frequency time series'' and ''length time series''. The esperanto text has more extreme vallues. A very rough approximation consists in modeling the texts as a random Cantor set if resulting from a binomial cascade of long and short words (or words and blanks). This leads to parameters characterizing the text style, and most likely in fine the author writings.

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates that multifractal analysis can uncover complex structural differences in word frequency and length between English and Esperanto texts.
It reveals that frequency time series expose marked variations in translation effects while length series capture consistent authorial style.
The study proposes that multifractal 'fingerprints' can inform applications in translation evaluation, author identification, and quantitative language classification.

This paper (A Comparison of natural (english) and artificial (esperanto) languages. A Multifractal method based analysis, 2008) explores the structural differences between a natural language (English) and an artificial language (Esperanto) by applying multifractal analysis techniques to written texts. The core idea is to treat a text as a one-dimensional signal and analyze its statistical properties, particularly long-range correlations and fluctuations, through the lens of fractal geometry and statistical physics. The authors use excerpts from Lewis Carroll's "Alice in Wonderland" (AWL) in both English and its Esperanto translation, and "Through a Looking Glass" (TLG) in English, to compare the characteristics of these languages and potentially the author's style.

The practical implementation involves transforming the text into numerical time series in two main ways:

Frequency Time Series (FTS): A sequence where each point represents the frequency of the word appearing at that position in the text. This requires pre-calculating word frequencies across the entire document.
Length Time Series (LTS): A sequence where each point represents the length (number of letters) of the word at that position in the text.

These time series are then analyzed using multifractal methods. A specific transformation is applied to the raw time series $y_i$ (either word frequency or length). A new series $M_i$ is created based on comparing adjacent values: $M_i = 2$ if $y_i < y_{i+1}$ , $M_i = 1$ if $y_i > y_{i+1}$ , and $M_i = 0$ if $y_i = y_{i+1}$ . The multifractal analysis is performed on this $M_i$ series.

The standard multifractal analysis procedure is then applied to the $M_i$ series of length $N'$ . This involves:

Dividing the series into non-overlapping boxes (subseries) of size $s$ .
Calculating a normalized sum of values within each box $\nu$ , denoted as $P(s, \nu)$ .
Computing the partition function $\chi(s, q) = \sum_{\nu} P(s, \nu)^q$ for various values of $q$ .
Determining the scaling exponent $\tau(q)$ from the power-law relationship $\chi(s, q) \sim s^{\tau(q)}$ by plotting $\log(\chi(s,q))$ against $\log(s)$ and estimating the slope for each $q$ .
Deriving the generalized fractal dimension $D(q) = \tau(q) / (q-1)$ (for $q \neq 1$ ).
Calculating the H\"{o}lder exponent $\alpha(q) = d\tau(q)/dq$ and the singularity spectrum $f(\alpha) = q\alpha(q) - \tau(q)$ .

The process requires iterating through different box sizes $s$ (the paper used $s$ from 2 to 200) and a range of $q$ values (from -25 to 25), performing linear regression in log-log space to find $\tau(q)$ , and then numerically differentiating $\tau(q)$ to find $\alpha(q)$ and subsequently $f(\alpha)$ .

To assess the robustness of the method and distinguish structural properties from random chance, the authors also perform the same analysis on shuffled versions of the texts. Shuffling is applied to the word sequence before generating the time series, effectively destroying original word order dependencies.

Key findings from the multifractal analysis:

Both original and shuffled texts exhibit multifractal behavior, indicated by $D(q)$ and $f(\alpha)$ curves that are not just single points, suggesting non-trivial correlations.
Comparing the $D(q)$ and $f(\alpha)$ spectra reveals differences between the texts and languages.
In FTS analysis, the Esperanto translation ( $AWL_{esp}$ ) shows marked quantitative differences from the English originals ( $AWL_{eng}$ , $TLG_{eng}$ ), particularly for negative $q$ values in $D(q)$ . This suggests differences in how word frequencies are distributed or correlated over the text sequence between the languages.
In LTS analysis, $AWL_{eng}$ and $AWL_{esp}$ are quantitatively similar in their $D(q)$ and $f(\alpha)$ curves, but both differ significantly from $TLG_{eng}$ . This suggests that word length sequences might be less sensitive to translation effects than frequency sequences, and might instead better capture author-specific stylistic patterns across different works.
The $f(\alpha)$ spectra are non-symmetric for all texts, even after shuffling, indicating complex, non-uniform scaling properties. The sharpness of the $f(\alpha)$ curve points to a high lack of uniformity in the distribution of word lengths/frequencies.

The authors propose a simplified physical model where the text structure is approximated by a binomial cascade involving two types of 'words' (e.g., short and long), characterized by contraction ratios ( $r_i$ ) and weights ( $w_i$ ). They suggest that parameters derived from this model (like $w_i$ and $r_i$ ) and the extremal values of the $f(\alpha)$ spectrum ( $\alpha_-, \alpha_+$ ) can serve as a measure of text style. Furthermore, they relate $\alpha_-$ and $\alpha_+$ to the Tsallis non-extensive statistical parameter $Q$ via the formula $1/(1-Q) = 1/\alpha_+ - 1/\alpha_-$ . The calculated $Q$ values vary between 4 and 7, with systematic differences between FTS and LTS, and more extreme values for the Esperanto text, suggesting different levels of complexity or "degrees of freedom" compared to the English texts within this framework.

Practical Implications and Implementation Considerations:

Text Style and Author Identification: The research provides a quantitative method to characterize text style and potentially identify authors based on the multifractal properties of their writings, specifically using LTS analysis which appeared less sensitive to translation. Implementing this would involve building a database of $D(q)$ and $f(\alpha)$ curves (or derived parameters like $Q$ and binomial cascade parameters) for various authors and texts, and then comparing new texts against this database.
Machine Translation Evaluation: The observed differences in FTS between source (English AWL) and translated (Esperanto AWL) texts suggest that multifractal analysis could potentially be used to evaluate the quality of machine translations by comparing the multifractal characteristics of the original and translated texts. A 'perfect' translation might aim to preserve certain multifractal properties, or perhaps a 'good' translation exhibits properties closer to natural texts in the target language. This could inform optimization goals for translation systems.
Language Characterization and Classification: The technique offers a way to quantitatively compare the structural properties of different languages, natural or artificial.
Computational Requirements: Calculating $D(q)$ and $f(\alpha)$ for a text of length $N$ involves sums over boxes of size $s$ and moments $q$ . For a typical text with tens of thousands of words, this is computationally feasible but requires careful implementation of loops and calculations for a range of $s$ and $q$ . Numerical precision is important, especially when dealing with potential singularities (e.g., $q=1$ for $D(q)$ ).
Data Preparation: Accurate text cleaning (removing non-textual elements, handling punctuation as per the paper's method) and robust tokenization are necessary first steps.
Shuffling Algorithm: A reliable shuffling method is needed to generate control texts for comparison.
Parameter Extraction: Implementing the extraction of $\alpha_-$ and $\alpha_+$ from the $f(\alpha)$ curve (typically finding the points where $f(\alpha)$ is non-zero or crosses the $\alpha$ axis) and calculating the $Q$ parameter adds complexity. Estimating binomial cascade parameters requires fitting the $f(\alpha)$ curve or solving Eq. (10) or (11) numerically.

In summary, the paper demonstrates that multifractal analysis, applied to text transformed into time series based on word lengths or frequencies, can reveal significant structural differences between languages and potentially capture aspects of authorial style. The practical application lies in using these multifractal 'fingerprints' for tasks like author identification, translation quality assessment, and quantitative language comparison. Implementation requires standard signal processing and statistical analysis techniques applied to carefully prepared text data.

PDF Markdown

A Comparison of natural (english) and artificial (esperanto) languages. A Multifractal method based analysis (0801.2510v1)

Summary

Related Papers