Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens (2401.17377v4)

Published 30 Jan 2024 in cs.CL, cs.AI, and cs.IR

Abstract: Are $n$-gram LLMs still relevant in this era of neural LLMs? Our answer is yes, and we showcase their values in both text analysis and improving neural LLMs. This was done by modernizing $n$-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens. This is the largest $n$-gram LM ever built. Second, existing $n$-gram LMs use small $n$ which hinders their performance; we instead allow $n$ to be arbitrarily large, by introducing a new $\infty$-gram LM with backoff. Instead of pre-computing $n$-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute $\infty$-gram (as well as $n$-gram with arbitrary $n$) probabilities with millisecond-level latency. The $\infty$-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the $\infty$-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their perplexity. When analyzing machine-generated text, we also observe irregularities in the machine--$\infty$-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers.

References (38)

Citations (29)

View on Semantic Scholar

Summary

The paper introduces a $ LM that begins backoff from an infinitely large n, enabling unbounded context and improved next-token prediction accuracy.
It employs a suffix array to dynamically compute probabilities, achieving millisecond-latency processing across trillions of tokens.
When integrated with neural models, the approach reduces perplexity by up to 73%, highlighting its potential to enhance hybrid language modeling.

Infini-gram: Scaling Unbounded n-gram LLMs to a Trillion Tokens

Introduction

The paper "Infini-gram: Scaling Unbounded n-gram LLMs to a Trillion Tokens" (2401.17377) addresses the query of whether classical n-gram LMs retain relevance amidst the prevalence of neural LLMs. The authors argue affirmatively, proposing that n-gram LMs can enhance text analysis and bolster neural LLMs by modernizing them to accommodate larger data scales—specifically 5 trillion tokens—and extending n to arbitrary sizes beyond typical limits. The infini-gram engine developed for this purpose leverages suffix arrays, facilitating millisecond-latency probability computations across vast data sets.

Methodology

Modernizing n-gram LMs

The paper introduces the $LM, which starts backoff from infinitely large n, allowing for unbounded consideration of context length ($ n $). Unlike traditional tables for counting n-grams that require pre-computation (which becomes infeasible as n increases), the$ LM uses dynamic calculations backed by a suffix array. This approach discards the constraints posed by small n, which limits context awareness and prediction accuracy (Figure 1).

Figure 1: An example where a 5-gram LM gives an incorrect prediction but the $gives the correct prediction by using the suffix of the prompt with a non-zero corpus count.</p></p> <h3 class='paper-heading' id='infini-gram-engine'>Infini-gram Engine</h3> <p>The infini-gram engine is designed to cope efficiently with the computational burdens associated with large-scale n-gram models. By employing a suffix array—the lexicographical ordering of all suffixes of a token array—infini-gram processes queries with low latency, handling trillions of tokens effectively (Figure 2). <img src="https://emergentmind-storage-cdn-c7atfsgud9cecchk.z01.azurefd.net/paper-images/2401-17377/sa.png" alt="Figure 2" title="" class="markdown-image" loading="lazy"> <p class="figure-caption">Figure 2: Left: Suffix array for a toy string; Right: The suffix array in the infini-gram index.</p></p> <h2 class='paper-heading' id='analysis-and-results'>Analysis and Results</h2><h3 class='paper-heading' id='human-written-text'>Human-written Text</h3> <p>The paper reports high token-wise prediction accuracy with$ LMs when analyzing human-written text, achieving next-token prediction accuracy of 47%. This accuracy improves with longer effective n values, demonstrating that context-rich predictions correlate strongly with human text production patterns (Figure 3).

Figure 3: Token-wise agreement between human-written text and n-gram/ $LMs.</p></p> <h3 class='paper-heading' id='machine-generated-text'>Machine-generated Text</h3> <p>For <a href="https://www.emergentmind.com/topics/machine-generated-text-mgt" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">machine-generated text</a>, the$ LM reveals patterned agreement variations depending on the decoding methods used by neural LMs, such as greedy, temperature, and nucleus sampling (Figure 4). Nucleus sampling text mirrors human-written text closest, while greedy decoding shows context fluctuation, hinting at training deficiencies in neural models.

Figure 4: Token-wise agreement between machine-generated text and $.</p></p> <h3 class='paper-heading' id='complementing-neural-lms'>Complementing Neural LMs</h3> <p>The paper demonstrates that$ LMs can reduce neural LMs' perplexity significantly, noting a 73% perplexity reduction when $methods are interpolated with neural estimates. This showcases the potential of integrative models in predictive applications across large-scale text models (Figure 5). <img src="https://emergentmind-storage-cdn-c7atfsgud9cecchk.z01.azurefd.net/paper-images/2401-17377/sa2.png" alt="Figure 5" title="" class="markdown-image" loading="lazy"> <p class="figure-caption">Figure 5: n-gram/$ queries on a training data, supported by a suffix array.

Implications and Future Work

Practical Implications

The findings suggest that modernized n-gram frameworks, like the infini-gram engine, can be pivotal in processing voluminous data for contexts crucial to next-token prediction while optimizing storage and computational resources. It also implies potential enhancements in neural model training by assimilating dense $computations for hybrid LMs.</p> <h3 class='paper-heading' id='theoretical-implications'>Theoretical Implications</h3> <p>Theoretically, the introduction of unbounded n-gram methodologies reinvigorates classical statistical models, presenting opportunities for deeper analyses into language modeling, combinatory contexts in language generation, and in understanding linguistic patterns across human and machine-generated texts.</p> <h3 class='paper-heading' id='future-developments'>Future Developments</h3> <p>The trajectory for future research includes refining$ LMs for general text generation tasks, further integrating with neural networks for real-time processing efficiencies, and potentially extending the system for broader applications in NLP and AI, including robust document retrieval and improved memory-augmented frameworks.

Conclusion

In conclusion, the "Infini-gram: Scaling Unbounded n-gram LLMs to a Trillion Tokens" paper validates the ongoing relevance of n-gram models within AI further enriched by contemporary methodologies. It serves as a cornerstone for future explorations in scalable, efficient statistical modeling interfacing vast corpus data.