Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Comparison of natural (english) and artificial (esperanto) languages. A Multifractal method based analysis (0801.2510v1)

Published 16 Jan 2008 in cs.CL and physics.data-an

Abstract: We present a comparison of two english texts, written by Lewis Carroll, one (Alice in wonderland) and the other (Through a looking glass), the former translated into esperanto, in order to observe whether natural and artificial languages significantly differ from each other. We construct one dimensional time series like signals using either word lengths or word frequencies. We use the multifractal ideas for sorting out correlations in the writings. In order to check the robustness of the methods we also write the corresponding shuffled texts. We compare characteristic functions and e.g. observe marked differences in the (far from parabolic) f(alpha) curves, differences which we attribute to Tsallis non extensive statistical features in the ''frequency time series'' and ''length time series''. The esperanto text has more extreme vallues. A very rough approximation consists in modeling the texts as a random Cantor set if resulting from a binomial cascade of long and short words (or words and blanks). This leads to parameters characterizing the text style, and most likely in fine the author writings.

Citations (2)

Summary

  • The paper demonstrates that multifractal analysis can uncover complex structural differences in word frequency and length between English and Esperanto texts.
  • It reveals that frequency time series expose marked variations in translation effects while length series capture consistent authorial style.
  • The study proposes that multifractal 'fingerprints' can inform applications in translation evaluation, author identification, and quantitative language classification.

This paper (A Comparison of natural (english) and artificial (esperanto) languages. A Multifractal method based analysis, 2008) explores the structural differences between a natural language (English) and an artificial language (Esperanto) by applying multifractal analysis techniques to written texts. The core idea is to treat a text as a one-dimensional signal and analyze its statistical properties, particularly long-range correlations and fluctuations, through the lens of fractal geometry and statistical physics. The authors use excerpts from Lewis Carroll's "Alice in Wonderland" (AWL) in both English and its Esperanto translation, and "Through a Looking Glass" (TLG) in English, to compare the characteristics of these languages and potentially the author's style.

The practical implementation involves transforming the text into numerical time series in two main ways:

  1. Frequency Time Series (FTS): A sequence where each point represents the frequency of the word appearing at that position in the text. This requires pre-calculating word frequencies across the entire document.
  2. Length Time Series (LTS): A sequence where each point represents the length (number of letters) of the word at that position in the text.

These time series are then analyzed using multifractal methods. A specific transformation is applied to the raw time series yiy_i (either word frequency or length). A new series MiM_i is created based on comparing adjacent values: Mi=2M_i = 2 if yi<yi+1y_i < y_{i+1}, Mi=1M_i = 1 if yi>yi+1y_i > y_{i+1}, and Mi=0M_i = 0 if yi=yi+1y_i = y_{i+1}. The multifractal analysis is performed on this MiM_i series.

The standard multifractal analysis procedure is then applied to the MiM_i series of length NN'. This involves:

  • Dividing the series into non-overlapping boxes (subseries) of size ss.
  • Calculating a normalized sum of values within each box ν\nu, denoted as P(s,ν)P(s, \nu).
  • Computing the partition function χ(s,q)=νP(s,ν)q\chi(s, q) = \sum_{\nu} P(s, \nu)^q for various values of qq.
  • Determining the scaling exponent τ(q)\tau(q) from the power-law relationship χ(s,q)sτ(q)\chi(s, q) \sim s^{\tau(q)} by plotting log(χ(s,q))\log(\chi(s,q)) against log(s)\log(s) and estimating the slope for each qq.
  • Deriving the generalized fractal dimension D(q)=τ(q)/(q1)D(q) = \tau(q) / (q-1) (for q1q \neq 1).
  • Calculating the H\"{o}lder exponent α(q)=dτ(q)/dq\alpha(q) = d\tau(q)/dq and the singularity spectrum f(α)=qα(q)τ(q)f(\alpha) = q\alpha(q) - \tau(q).

The process requires iterating through different box sizes ss (the paper used ss from 2 to 200) and a range of qq values (from -25 to 25), performing linear regression in log-log space to find τ(q)\tau(q), and then numerically differentiating τ(q)\tau(q) to find α(q)\alpha(q) and subsequently f(α)f(\alpha).

To assess the robustness of the method and distinguish structural properties from random chance, the authors also perform the same analysis on shuffled versions of the texts. Shuffling is applied to the word sequence before generating the time series, effectively destroying original word order dependencies.

Key findings from the multifractal analysis:

  • Both original and shuffled texts exhibit multifractal behavior, indicated by D(q)D(q) and f(α)f(\alpha) curves that are not just single points, suggesting non-trivial correlations.
  • Comparing the D(q)D(q) and f(α)f(\alpha) spectra reveals differences between the texts and languages.
  • In FTS analysis, the Esperanto translation (AWLespAWL_{esp}) shows marked quantitative differences from the English originals (AWLengAWL_{eng}, TLGengTLG_{eng}), particularly for negative qq values in D(q)D(q). This suggests differences in how word frequencies are distributed or correlated over the text sequence between the languages.
  • In LTS analysis, AWLengAWL_{eng} and AWLespAWL_{esp} are quantitatively similar in their D(q)D(q) and f(α)f(\alpha) curves, but both differ significantly from TLGengTLG_{eng}. This suggests that word length sequences might be less sensitive to translation effects than frequency sequences, and might instead better capture author-specific stylistic patterns across different works.
  • The f(α)f(\alpha) spectra are non-symmetric for all texts, even after shuffling, indicating complex, non-uniform scaling properties. The sharpness of the f(α)f(\alpha) curve points to a high lack of uniformity in the distribution of word lengths/frequencies.

The authors propose a simplified physical model where the text structure is approximated by a binomial cascade involving two types of 'words' (e.g., short and long), characterized by contraction ratios (rir_i) and weights (wiw_i). They suggest that parameters derived from this model (like wiw_i and rir_i) and the extremal values of the f(α)f(\alpha) spectrum (α,α+\alpha_-, \alpha_+) can serve as a measure of text style. Furthermore, they relate α\alpha_- and α+\alpha_+ to the Tsallis non-extensive statistical parameter QQ via the formula 1/(1Q)=1/α+1/α1/(1-Q) = 1/\alpha_+ - 1/\alpha_-. The calculated QQ values vary between 4 and 7, with systematic differences between FTS and LTS, and more extreme values for the Esperanto text, suggesting different levels of complexity or "degrees of freedom" compared to the English texts within this framework.

Practical Implications and Implementation Considerations:

  • Text Style and Author Identification: The research provides a quantitative method to characterize text style and potentially identify authors based on the multifractal properties of their writings, specifically using LTS analysis which appeared less sensitive to translation. Implementing this would involve building a database of D(q)D(q) and f(α)f(\alpha) curves (or derived parameters like QQ and binomial cascade parameters) for various authors and texts, and then comparing new texts against this database.
  • Machine Translation Evaluation: The observed differences in FTS between source (English AWL) and translated (Esperanto AWL) texts suggest that multifractal analysis could potentially be used to evaluate the quality of machine translations by comparing the multifractal characteristics of the original and translated texts. A 'perfect' translation might aim to preserve certain multifractal properties, or perhaps a 'good' translation exhibits properties closer to natural texts in the target language. This could inform optimization goals for translation systems.
  • Language Characterization and Classification: The technique offers a way to quantitatively compare the structural properties of different languages, natural or artificial.
  • Computational Requirements: Calculating D(q)D(q) and f(α)f(\alpha) for a text of length NN involves sums over boxes of size ss and moments qq. For a typical text with tens of thousands of words, this is computationally feasible but requires careful implementation of loops and calculations for a range of ss and qq. Numerical precision is important, especially when dealing with potential singularities (e.g., q=1q=1 for D(q)D(q)).
  • Data Preparation: Accurate text cleaning (removing non-textual elements, handling punctuation as per the paper's method) and robust tokenization are necessary first steps.
  • Shuffling Algorithm: A reliable shuffling method is needed to generate control texts for comparison.
  • Parameter Extraction: Implementing the extraction of α\alpha_- and α+\alpha_+ from the f(α)f(\alpha) curve (typically finding the points where f(α)f(\alpha) is non-zero or crosses the α\alpha axis) and calculating the QQ parameter adds complexity. Estimating binomial cascade parameters requires fitting the f(α)f(\alpha) curve or solving Eq. (10) or (11) numerically.

In summary, the paper demonstrates that multifractal analysis, applied to text transformed into time series based on word lengths or frequencies, can reveal significant structural differences between languages and potentially capture aspects of authorial style. The practical application lies in using these multifractal 'fingerprints' for tasks like author identification, translation quality assessment, and quantitative language comparison. Implementation requires standard signal processing and statistical analysis techniques applied to carefully prepared text data.