Are Language Models Actually Useful for Time Series Forecasting?

Published Jun 22, 2024 in cs.LG and cs.AI


LLMs are being applied to time series tasks, particularly time series forecasting. However, are language models actually useful for time series? After a series of ablation studies on three recent and popular LLM-based time series forecasting methods, we find that removing the LLM component or replacing it with a basic attention layer does not degrade the forecasting results -- in most cases the results even improved. We also find that despite their significant computational cost, pretrained LLMs do no better than models trained from scratch, do not represent the sequential dependencies in time series, and do not assist in few-shot settings. Additionally, we explore time series encoders and reveal that patching and attention structures perform similarly to state-of-the-art LLM-based forecasters.

Ablation studies show performance on time series forecasting improves or remains stable without LLM.


  • The paper critically examines the effectiveness of LLMs for time series forecasting, showing that simpler models often perform better or comparably with significantly lower computational costs.

  • Through detailed ablation studies and performance evaluations, the paper demonstrates that LLM-based methods for time series forecasting do not provide substantial benefits over simpler models like multi-head attention layers and transformer blocks.

  • The research highlights that pretraining LLMs on textual data does not offer significant advantages for time series tasks, and suggests a reevaluation of their application in this context, emphasizing efficient alternatives.

Evaluating the Utility of LLMs in Time Series Forecasting Tasks

The paper "Are Language Models Actually Useful for Time Series?" investigates the viability of leveraging LLMs for performing time series forecasting. Despite the growing trend to apply LLMs to time series tasks, this study presents a series of ablation and comparative analyses which suggest that the complexity of such models may not yield commensurate improvements in performance and may indeed be inefficient in terms of computational cost.

Key Findings

Performance of LLM-based Methods vs. Ablated Versions

The study evaluates three recent state-of-the-art LLM-based methods for time series forecasting: OneFitAll, Time-LLM, and LLaTA. Each method is subjected to three ablation scenarios: removing the LLM component entirely, replacing the LLM with a multi-head attention layer, and replacing the LLM with a simple transformer block. The results consistently show that these ablated models perform comparably or better than their LLM-based counterparts.

For instance, ablations outperformed Time-LLM, LLaTA, and OneFitsAll in 26/26, 22/26, and 19/26 cases, respectively, across various performance metrics and datasets. Notably, detailed 95% confidence intervals indicate that the performance overlap between simplified and LLM models is statistically significant, underscoring that LLMs do not provide substantial benefits for these tasks.

Computational Cost

The computational overhead brought about by LLMs is substantial. Time-LLM, with 6642 million parameters, significantly increases both training and inference times. The evaluation indicates that simpler models can reduce the training time by up to three orders of magnitude while maintaining or improving forecasting performance. Ablated models are typically found to be faster and more efficient, highlighting a stark contrast when compared to their LLM-based versions.

Contributions of Pretraining and Sequential Dependencies

A significant thrust of the analysis involves understanding whether pretraining LLMs on textual data can benefit time series forecasting. Results reveal that randomly initialized LLMs perform on par with pretrained ones, suggesting that pretraining on textual corpora does not confer a distinct advantage for time series tasks. Furthermore, evaluations involving shuffled and masked input sequences show that LLM-based models do not effectively capture sequential dependencies beyond what non-LLM models achieve.

Few-shot Learning and Encoding Approaches

Despite the known success of LLMs in few-shot and transfer learning, the paper demonstrates that ablated models match or exceed the performance of LLM-based methods even when trained on just 10% of the training data. This finding holds significant implications for scenarios with limited data availability.

The study also explores various encoding strategies to understand the sources of performance in LLM-based models. It concludes that encoding techniques like patching combined with multi-head attention or simple transformers can yield effective representations, obviating the need for the full complexity of LLMs.

Implications and Future Directions

The findings indicate that LLMs may not justify their computational costs for traditional time series forecasting tasks. This divergence in anticipated versus actual utility invites researchers to re-evaluate the application contexts where LLMs are genuinely advantageous. Future developments may focus on hybrid or multimodal applications where the innate capabilities of LLMs in understanding natural language can complement time series data, as suggested by emerging applications in social understanding or more general time series reasoning tasks.


By systematically dismantling popular LLM-based time series forecasting models, this paper critically reassesses the role of LLMs in such contexts, highlighting simpler yet equally robust alternatives. These insights serve to guide researchers in developing more efficient and effective time series models, encouraging a balanced approach between leveraging advanced language models and ensuring computational feasibility.

