Temporal Scaling Law for Large Language Models

Published 27 Apr 2024 in cs.CL | (2404.17785v3)

Abstract: Recently, LLMs have been widely adopted in a wide range of tasks, leading to increasing attention towards the research on how scaling LLMs affects their performance. Existing works, termed Scaling Laws, have discovered that the final test loss of LLMs scales as power-laws with model size, computational budget, and dataset size. However, the temporal change of the test loss of an LLM throughout its pre-training process remains unexplored, though it is valuable in many aspects, such as selecting better hyperparameters \textit{directly} on the target LLM. In this paper, we propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up. In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position, and further develop a dynamic hyperbolic-law. Afterwards, we derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law. Results on both in-distribution (ID) and out-of-distribution (OOD) validation datasets demonstrate that our temporal scaling law accurately predicts the test loss of LLMs across training steps. Our temporal scaling law has broad practical applications. First, it enables direct and efficient hyperparameter selection on the target LLM, such as data mixture proportions. Secondly, viewing the LLM pre-training dynamics from the token position granularity provides some insights to enhance the understanding of LLM pre-training.

Abstract PDF HTML Upgrade to Chat

Authors (11)

Citations (5)

View on Semantic Scholar

Summary

The paper introduces the Temporal Scaling Law, a novel framework describing how loss at different token positions evolves following a reciprocal-law throughout Large Language Model training.
The study tracks how the reciprocal-law parameters change over time, revealing distinct patterns related to trained tokens and the learning rate decay schedule.
This law allows for accurate prediction of future test loss and indicates that models learn uniformly across token positions after initial training phases, validated by experiments on various datasets.

The paper "Temporal Scaling Law for LLMs" explores the dynamics of loss function behavior in LLMs throughout the training process, introducing the concept of the Temporal Scaling Law. Unlike traditional scaling laws, which focus on static properties such as model size and dataset size, this work investigates the temporal dimension of training, specifically how loss evolves over time across different token positions in sequences.

Key contributions and findings of the paper include:

Temporal Scaling Law: The researchers propose that, during the training of decoder-based generative LLMs, the loss at different token positions follows a reciprocal-law, which is consistent across various scales and training stages. This reciprocal-law can be mathematically captured by:

$\mathcal{L}_i = \frac{a_0}{1 + a_1(i-1)} + a_2$

Where: - $\mathcal{L}_i$ : Loss at token position $i$ . - $a_0$ , $a_1$ , $a_2$ : Empirically derived parameters based on model scale and training time.

Temporal Patterns:

The evolution of the parameters $a_0$ , $a_1$ , and $a_2$ is meticulously studied over time. It is found that: - Prior to a specific threshold in training, $a_0$ shows a logarithmic relationship, and $a_1$ a reciprocal relationship with the number of trained tokens. - parameter $a_2$ closely follows the learning rate decay pattern, suggesting its strong correlation with this training aspect.

Loss Prediction: The established temporal scaling patterns allow for the prediction of future test loss, leveraging initial training data to accurately forecast the training trajectory. Empirical tests indicate a marked improvement in predictive accuracy over baseline approaches such as those using exponential functions or simple reciprocal laws.
Uniform Learning Across Tokens: Despite initial disparities in loss across token positions, the study observes that LLMs tend to learn uniformly across all token positions after an initial phase of training. This suggests that the existing paradigm of averaging losses across tokens without position-based weighting is a robust approach to LLM training, as re-weighting strategies do not confer performance benefits.
Verification through Experiments: The research incorporates two primary datasets; an in-distribution (IID) dataset from the Pile and an out-of-distribution (OOD) dataset from PG-19. Various scales of models are trained to validation the reciprocal-law and predict future test loss utilizing the derived temporal scaling law.

The study effectively challenges the conventional understanding of scaling laws by introducing a temporal component, thereby enabling a nuanced analysis of LLM training dynamics that could inform more efficient resource utilization and training strategy development in large-scale model pre-training.

Markdown Report Issue