Papers
Topics
Authors
Recent
2000 character limit reached

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens (2401.17377v4)

Published 30 Jan 2024 in cs.CL, cs.AI, and cs.IR

Abstract: Are $n$-gram LLMs still relevant in this era of neural LLMs? Our answer is yes, and we showcase their values in both text analysis and improving neural LLMs. This was done by modernizing $n$-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens. This is the largest $n$-gram LM ever built. Second, existing $n$-gram LMs use small $n$ which hinders their performance; we instead allow $n$ to be arbitrarily large, by introducing a new $\infty$-gram LM with backoff. Instead of pre-computing $n$-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute $\infty$-gram (as well as $n$-gram with arbitrary $n$) probabilities with millisecond-level latency. The $\infty$-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the $\infty$-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their perplexity. When analyzing machine-generated text, we also observe irregularities in the machine--$\infty$-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Quantitative analysis of culture using millions of digitized books. Science, 331:176 – 182, 2011. URL https://api.semanticscholar.org/CorpusID:40104730.
  2. Mining source code repositories at massive scale using language modeling. 2013 10th Working Conference on Mining Software Repositories (MSR), pp.  207–216, 2013. URL https://api.semanticscholar.org/CorpusID:1857729.
  3. Adaptive input representations for neural language modeling. In Proceedings of the International Conference on Learning Representations, 2019.
  4. Improving language models by retrieving from trillions of tokens. In Proceedings of the International Conference of Machine Learning, 2022.
  5. Accelerating large language model decoding with speculative sampling. ArXiv, abs/2302.01318, 2023. URL https://api.semanticscholar.org/CorpusID:256503945.
  6. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Conference on Empirical Methods in Natural Language Processing, 2021. URL https://api.semanticscholar.org/CorpusID:237568724.
  7. What’s in my big data? ArXiv, abs/2310.20707, 2023. URL https://api.semanticscholar.org/CorpusID:264803575.
  8. All our n-gram are belong to you. Google Machine Translation Team, 20, 2006. URL https://blog.research.google/2006/08/all-our-n-gram-are-belong-to-you.html.
  9. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  10. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
  11. Dirk Groeneveld. The big friendly filter. https://github.com/allenai/bff, 2023.
  12. Retrieval augmented language model pre-training. In Proceedings of the International Conference of Machine Learning, 2020.
  13. Rest: Retrieval-based speculative decoding. 2023. URL https://api.semanticscholar.org/CorpusID:265157884.
  14. The curious case of neural text degeneration. ArXiv, abs/1904.09751, 2019. URL https://api.semanticscholar.org/CorpusID:127986954.
  15. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
  16. Speech and language processing - an introduction to natural language processing, computational linguistics, and speech recognition. In Prentice Hall series in artificial intelligence, 2000. URL https://api.semanticscholar.org/CorpusID:60691216.
  17. Linear work suffix array construction. J. ACM, 53:918–936, 2006. URL https://api.semanticscholar.org/CorpusID:12825385.
  18. Slava M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust. Speech Signal Process., 35:400–401, 1987. URL https://api.semanticscholar.org/CorpusID:6555412.
  19. Suffix trees as language models. In International Conference on Language Resources and Evaluation, 2012. URL https://api.semanticscholar.org/CorpusID:12071964.
  20. Generalization through memorization: Nearest neighbor language models. In Proceedings of the International Conference on Learning Representations, 2020.
  21. Copy is all you need. In Proceedings of the International Conference on Learning Representations, 2023.
  22. Deduplicating training data makes language models better. In Proceedings of the Association for Computational Linguistics, 2022.
  23. Residual learning of neural text generation with n-gram language model. In Findings of the Association for Computational Linguistics: EMNLP 2022, 2022. URL https://aclanthology.org/2022.findings-emnlp.109.
  24. Data portraits: Recording foundation model training data. ArXiv, abs/2303.03919, 2023. URL https://api.semanticscholar.org/CorpusID:257378087.
  25. SILO language models: Isolating legal risk in a nonparametric datastore. arXiv preprint arXiv:2308.04430, 2023a. URL https://arxiv.org/abs/2308.04430.
  26. Nonparametric masked language modeling. In Findings of ACL, 2023b.
  27. Language models as knowledge bases? ArXiv, abs/1909.01066, 2019. URL https://api.semanticscholar.org/CorpusID:202539551.
  28. The roots search tool: Data transparency for llms. In Annual Meeting of the Association for Computational Linguistics, 2023. URL https://api.semanticscholar.org/CorpusID:257219882.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  30. Compact, efficient and unlimited capacity: Language modeling with compressed suffix trees. In Conference on Empirical Methods in Natural Language Processing, 2015. URL https://api.semanticscholar.org/CorpusID:225428.
  31. REPLUG: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
  32. Dolma: An Open Corpus of 3 Trillion Tokens for Language Model Pretraining Research. Technical report, Allen Institute for AI, 2023. Released under ImpACT License as Medium Risk artifact, https://github.com/allenai/dolma.
  33. Herman Stehouwer and Menno van Zaanen. Using suffix arrays as language models: Scaling the n-gram. 2010. URL https://api.semanticscholar.org/CorpusID:18379946.
  34. Together. RedPajama: An open source recipe to reproduce LLaMA training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  35. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  37. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  38. Training language models with memory augmentation. In Proceedings of Empirical Methods in Natural Language Processing, 2022.
Citations (29)

Summary

  • The paper introduces a $ LM that begins backoff from an infinitely large n, enabling unbounded context and improved next-token prediction accuracy.
  • It employs a suffix array to dynamically compute probabilities, achieving millisecond-latency processing across trillions of tokens.
  • When integrated with neural models, the approach reduces perplexity by up to 73%, highlighting its potential to enhance hybrid language modeling.

Infini-gram: Scaling Unbounded n-gram LLMs to a Trillion Tokens

Introduction

The paper "Infini-gram: Scaling Unbounded n-gram LLMs to a Trillion Tokens" (2401.17377) addresses the query of whether classical n-gram LMs retain relevance amidst the prevalence of neural LLMs. The authors argue affirmatively, proposing that n-gram LMs can enhance text analysis and bolster neural LLMs by modernizing them to accommodate larger data scales—specifically 5 trillion tokens—and extending n to arbitrary sizes beyond typical limits. The infini-gram engine developed for this purpose leverages suffix arrays, facilitating millisecond-latency probability computations across vast data sets.

Methodology

Modernizing n-gram LMs

The paper introduces the LM,whichstartsbackofffrominfinitelylargen,allowingforunboundedconsiderationofcontextlength(LM, which starts backoff from infinitely large n, allowing for unbounded consideration of context length (n).Unliketraditionaltablesforcountingn−gramsthatrequirepre−computation(whichbecomesinfeasibleasnincreases),the). Unlike traditional tables for counting n-grams that require pre-computation (which becomes infeasible as n increases), the LM uses dynamic calculations backed by a suffix array. This approach discards the constraints posed by small n, which limits context awareness and prediction accuracy (Figure 1). Figure 1

Figure 1

Figure 1: An example where a 5-gram LM gives an incorrect prediction but the givesthecorrectpredictionbyusingthesuffixofthepromptwithanon−zerocorpuscount.</p></p><h3class=′paper−heading′id=′infini−gram−engine′>Infini−gramEngine</h3><p>Theinfini−gramengineisdesignedtocopeefficientlywiththecomputationalburdensassociatedwithlarge−scalen−grammodels.Byemployingasuffixarray—thelexicographicalorderingofallsuffixesofatokenarray—infini−gramprocessesquerieswithlowlatency,handlingtrillionsoftokenseffectively(Figure2).<imgsrc="https://emergentmind−storage−cdn−c7atfsgud9cecchk.z01.azurefd.net/paper−images/2401−17377/sa.png"alt="Figure2"title=""class="markdown−image"loading="lazy"><pclass="figure−caption">Figure2:Left:Suffixarrayforatoystring;Right:Thesuffixarrayintheinfini−gramindex.</p></p><h2class=′paper−heading′id=′analysis−and−results′>AnalysisandResults</h2><h3class=′paper−heading′id=′human−written−text′>Human−writtenText</h3><p>Thepaperreportshightoken−wisepredictionaccuracywith gives the correct prediction by using the suffix of the prompt with a non-zero corpus count.</p></p> <h3 class='paper-heading' id='infini-gram-engine'>Infini-gram Engine</h3> <p>The infini-gram engine is designed to cope efficiently with the computational burdens associated with large-scale n-gram models. By employing a suffix array—the lexicographical ordering of all suffixes of a token array—infini-gram processes queries with low latency, handling trillions of tokens effectively (Figure 2). <img src="https://emergentmind-storage-cdn-c7atfsgud9cecchk.z01.azurefd.net/paper-images/2401-17377/sa.png" alt="Figure 2" title="" class="markdown-image" loading="lazy"> <p class="figure-caption">Figure 2: Left: Suffix array for a toy string; Right: The suffix array in the infini-gram index.</p></p> <h2 class='paper-heading' id='analysis-and-results'>Analysis and Results</h2><h3 class='paper-heading' id='human-written-text'>Human-written Text</h3> <p>The paper reports high token-wise prediction accuracy with LMs when analyzing human-written text, achieving next-token prediction accuracy of 47%. This accuracy improves with longer effective n values, demonstrating that context-rich predictions correlate strongly with human text production patterns (Figure 3). Figure 3

Figure 3

Figure 3

Figure 3: Token-wise agreement between human-written text and n-gram/LMs.</p></p><h3class=′paper−heading′id=′machine−generated−text′>Machine−generatedText</h3><p>For<ahref="https://www.emergentmind.com/topics/machine−generated−text−mgt"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">machine−generatedtext</a>,the LMs.</p></p> <h3 class='paper-heading' id='machine-generated-text'>Machine-generated Text</h3> <p>For <a href="https://www.emergentmind.com/topics/machine-generated-text-mgt" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">machine-generated text</a>, the LM reveals patterned agreement variations depending on the decoding methods used by neural LMs, such as greedy, temperature, and nucleus sampling (Figure 4). Nucleus sampling text mirrors human-written text closest, while greedy decoding shows context fluctuation, hinting at training deficiencies in neural models. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Token-wise agreement between machine-generated text and .</p></p><h3class=′paper−heading′id=′complementing−neural−lms′>ComplementingNeuralLMs</h3><p>Thepaperdemonstratesthat.</p></p> <h3 class='paper-heading' id='complementing-neural-lms'>Complementing Neural LMs</h3> <p>The paper demonstrates that LMs can reduce neural LMs' perplexity significantly, noting a 73% perplexity reduction whenmethodsareinterpolatedwithneuralestimates.Thisshowcasesthepotentialofintegrativemodelsinpredictiveapplicationsacrosslarge−scaletextmodels(Figure5).<imgsrc="https://emergentmind−storage−cdn−c7atfsgud9cecchk.z01.azurefd.net/paper−images/2401−17377/sa2.png"alt="Figure5"title=""class="markdown−image"loading="lazy"><pclass="figure−caption">Figure5:n−gram/ methods are interpolated with neural estimates. This showcases the potential of integrative models in predictive applications across large-scale text models (Figure 5). <img src="https://emergentmind-storage-cdn-c7atfsgud9cecchk.z01.azurefd.net/paper-images/2401-17377/sa2.png" alt="Figure 5" title="" class="markdown-image" loading="lazy"> <p class="figure-caption">Figure 5: n-gram/ queries on a training data, supported by a suffix array.

Implications and Future Work

Practical Implications

The findings suggest that modernized n-gram frameworks, like the infini-gram engine, can be pivotal in processing voluminous data for contexts crucial to next-token prediction while optimizing storage and computational resources. It also implies potential enhancements in neural model training by assimilating dense computationsforhybridLMs.</p><h3class=′paper−heading′id=′theoretical−implications′>TheoreticalImplications</h3><p>Theoretically,theintroductionofunboundedn−grammethodologiesreinvigoratesclassicalstatisticalmodels,presentingopportunitiesfordeeperanalysesintolanguagemodeling,combinatorycontextsinlanguagegeneration,andinunderstandinglinguisticpatternsacrosshumanandmachine−generatedtexts.</p><h3class=′paper−heading′id=′future−developments′>FutureDevelopments</h3><p>Thetrajectoryforfutureresearchincludesrefining computations for hybrid LMs.</p> <h3 class='paper-heading' id='theoretical-implications'>Theoretical Implications</h3> <p>Theoretically, the introduction of unbounded n-gram methodologies reinvigorates classical statistical models, presenting opportunities for deeper analyses into language modeling, combinatory contexts in language generation, and in understanding linguistic patterns across human and machine-generated texts.</p> <h3 class='paper-heading' id='future-developments'>Future Developments</h3> <p>The trajectory for future research includes refining LMs for general text generation tasks, further integrating with neural networks for real-time processing efficiencies, and potentially extending the system for broader applications in NLP and AI, including robust document retrieval and improved memory-augmented frameworks.

Conclusion

In conclusion, the "Infini-gram: Scaling Unbounded n-gram LLMs to a Trillion Tokens" paper validates the ongoing relevance of n-gram models within AI further enriched by contemporary methodologies. It serves as a cornerstone for future explorations in scalable, efficient statistical modeling interfacing vast corpus data.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 43 tweets with 1182 likes about this paper.