2000 character limit reached
Scaling-laws for Large Time-series Models (2405.13867v2)
Published 22 May 2024 in cs.LG and cs.AI
Abstract: Scaling laws for LLMs have provided useful guidance in training ever larger models for predictable performance gains. Time series forecasting shares a similar sequential structure to language, and is amenable to large-scale transformer architectures. Here we show that foundational decoder-only time series transformer models exhibit analogous scaling-behavior to LLMs, with architectural details (aspect ratio and number of heads) having a minimal effect over broad ranges. We assemble a large corpus of heterogenous time series data on which to train, and establish for the first time power-law scaling with parameter count, dataset size, and training compute, spanning five orders of magnitude.
- K. P. Körding and D. M. Wolpert, “Bayesian integration in sensorimotor learning,” Nature, vol. 427, no. 6971, pp. 244–247, 2004.
- K. Doya, Bayesian brain: Probabilistic approaches to neural coding. MIT press, 2007.
- K. Doya, “Modulators of decision making,” Nature neuroscience, vol. 11, no. 4, pp. 410–416, 2008.
- A. Funamizu, B. Kuhn, and K. Doya, “Neural substrate of dynamic bayesian inference in the cerebral cortex,” Nature neuroscience, vol. 19, no. 12, pp. 1682–1689, 2016.
- C. Lindig-León, N. Kaur, and D. A. Braun, “From bayes-optimal to heuristic decision-making in a two-alternative forced choice task with an information-theoretic bounded rationality model,” Frontiers in Neuroscience, vol. 16, p. 906198, 2022.
- M. West and J. Harrison, Bayesian Forecasting and Dynamic Models. Springer Series in Statistics, Springer New York, 2013.
- R. J. Hyndman and G. Athanasopoulos, Forecasting: principles and practice. OTexts, 2018.
- J. F. Torres, D. Hadjout, A. Sebaa, F. Martínez-Álvarez, and A. Troncoso, “Deep learning for time series forecasting: a survey,” Big Data, vol. 9, no. 1, pp. 3–21, 2021.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, et al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021.
- A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International conference on machine learning, pp. 8821–8831, Pmlr, 2021.
- W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas, “Videogpt: Video generation using vq-vae and transformers,” arXiv preprint arXiv:2104.10157, 2021.
- A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 6836–6846, 2021.
- K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022.
- J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International conference on machine learning, pp. 19730–19742, PMLR, 2023.
- A. Das, W. Kong, R. Sen, and Y. Zhou, “A decoder-only foundation model for time-series forecasting,” arXiv e-prints, p. arXiv:2310.10688, Oct. 2023.
- M. Goswami, K. Szafer, A. Choudhry, Y. Cai, S. Li, and A. Dubrawski, “MOMENT: A Family of Open Time-series Foundation Models,” arXiv e-prints, p. arXiv:2402.03885, Feb. 2024.
- K. Rasul, A. Ashok, A. R. Williams, A. Khorasani, G. Adamopoulos, R. Bhagwatkar, M. Biloš, H. Ghonia, N. V. Hassen, A. Schneider, S. Garg, A. Drouin, N. Chapados, Y. Nevmyvaka, and I. Rish, “Lag-Llama: Towards Foundation Models for Time Series Forecasting,” arXiv e-prints, p. arXiv:2310.08278, Oct. 2023.
- A. Garza and M. Mergenthaler-Canseco, “TimeGPT-1,” arXiv e-prints, p. arXiv:2310.03589, Oct. 2023.
- Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam, “A Time Series is Worth 64 Words: Long-term Forecasting with Transformers,” arXiv e-prints, p. arXiv:2211.14730, Nov. 2022.
- G. Woo, C. Liu, A. Kumar, C. Xiong, S. Savarese, and D. Sahoo, “Unified Training of Universal Time Series Forecasting Transformers,” arXiv e-prints, p. arXiv:2402.02592, Feb. 2024.
- G. Woo, C. Liu, A. Kumar, and D. Sahoo, “Pushing the Limits of Pre-training for Time Series Forecasting in the CloudOps Domain,” arXiv e-prints, p. arXiv:2310.05063, Oct. 2023.
- W. Xue, T. Zhou, Q. Wen, J. Gao, B. Ding, and R. Jin, “CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting,” arXiv e-prints, p. arXiv:2305.12095, May 2023.
- R. Ilbert, A. Odonnat, V. Feofanov, A. Virmaux, G. Paolo, T. Palpanas, and I. Redko, “Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention,” arXiv e-prints, p. arXiv:2402.10198, Feb. 2024.
- D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski, “Deepar: Probabilistic forecasting with autoregressive recurrent networks,” International journal of forecasting, vol. 36, no. 3, pp. 1181–1191, 2020.
- B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio, “N-beats: Neural basis expansion analysis for interpretable time series forecasting,” arXiv preprint arXiv:1905.10437, 2019.
- B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio, “Meta-learning framework with applications to zero-shot time-series forecasting,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 9242–9250, 2021.
- N. Gruver, M. Finzi, S. Qiu, and A. G. Wilson, “Large language models are zero-shot time series forecasters,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- Q. Ma, Z. Liu, Z. Zheng, Z. Huang, S. Zhu, Z. Yu, and J. T. Kwok, “A survey on time-series pre-trained models,” arXiv preprint arXiv:2305.10716, 2023.
- A. Fatir Ansari, L. Stella, C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchur, S. Sundar Rangapuram, S. Pineda Arango, S. Kapoor, J. Zschiegner, D. C. Maddix, H. Wang, M. W. Mahoney, K. Torkkola, A. G. Wilson, M. Bohlke-Schneider, and Y. Wang, “Chronos: Learning the Language of Time Series,” arXiv e-prints, p. arXiv:2403.07815, Mar. 2024.
- J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural Language Models,” arXiv e-prints, p. arXiv:2001.08361, Jan. 2020.
- M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning, pp. 6105–6114, PMLR, 2019.
- M. Tan and Q. Le, “Efficientnetv2: Smaller models and faster training,” in International conference on machine learning, pp. 10096–10106, PMLR, 2021.
- M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy, “Do vision transformers see like convolutional neural networks?,” Advances in neural information processing systems, vol. 34, pp. 12116–12128, 2021.
- C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Susano Pinto, D. Keysers, and N. Houlsby, “Scaling vision with sparse mixture of experts,” Advances in Neural Information Processing Systems, vol. 34, pp. 8583–8595, 2021.
- T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray, et al., “Scaling laws for autoregressive generative modeling,” arXiv preprint arXiv:2010.14701, 2020.
- H. Hersbach, “Decomposition of the continuous ranked probability score for ensemble prediction systems,” Weather and Forecasting, vol. 15, no. 5, pp. 559–570, 2000.
- R. Godahewa, C. Bergmeir, G. I. Webb, R. J. Hyndman, and P. Montero-Manso, “Monash time series forecasting archive,” in Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
- P. Emami, A. Sahu, and P. Graf, “Buildingsbench: A large-scale dataset of 900k buildings and benchmark for short-term load forecasting,” Advances in Neural Information Processing Systems, 2023.
- X. Liu, Y. Xia, Y. Liang, J. Hu, Y. Wang, L. Bai, C. Huang, Z. Liu, B. Hooi, and R. Zimmermann, “Largest: A benchmark dataset for large-scale traffic forecasting,” in Advances in Neural Information Processing Systems, 2023.
- P. Warden, “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition,” arXiv e-prints, p. arXiv:1804.03209, Apr. 2018.
- D. Stowell, M. D. Wood, H. Pamuła, Y. Stylianou, and H. Glotin, “Automatic acoustic detection of birds through deep learning: The first bird audio detection challenge,” Methods in Ecology and Evolution, vol. 10, no. 3, pp. 368–380, 2019.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- M. B. Bjerregård, J. K. Møller, and H. Madsen, “An introduction to multivariate probabilistic forecast evaluation,” Energy and AI, vol. 4, p. 100058, 2021.
- S. McCandlish, J. Kaplan, D. Amodei, and OpenAI Dota Team, “An Empirical Model of Large-Batch Training,” arXiv e-prints, p. arXiv:1812.06162, Dec. 2018.
- Gemini Team, “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv e-prints, p. arXiv:2403.05530, Mar. 2024.
- R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. Fortunato, F. Alet, S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu, A. Merose, S. Hoyer, G. Holland, O. Vinyals, J. Stott, A. Pritzel, S. Mohamed, and P. Battaglia, “Learning skillful medium-range global weather forecasting,” Science, vol. 382, pp. 1416–1421, Dec. 2023.
- J. Pathak, S. Subramanian, P. Harrington, S. Raja, A. Chattopadhyay, M. Mardani, T. Kurth, D. Hall, Z. Li, K. Azizzadenesheli, P. Hassanzadeh, K. Kashinath, and A. Anandkumar, “FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators,” arXiv e-prints, p. arXiv:2202.11214, Feb. 2022.
- E. J. H. Wilson, A. Parker, A. Fontanini, E. Present, J. L. Reyna, R. Adhikari, C. Bianchi, C. CaraDonna, M. Dahlhausen, J. Kim, A. LeBar, L. Liu, M. Praprost, L. Zhang, P. DeWitt, N. Merket, A. Speake, T. Hong, H. Li, N. M. Frick, Z. Wang, A. Blair, H. Horsey, D. Roberts, K. Trenbath, O. Adekanye, E. Bonnema, R. El Kontar, J. Gonzalez, S. Horowitz, D. Jones, R. T. Muehleisen, S. Platthotam, M. Reynolds, J. Robertson, K. Sayers, and Q. Li, “End-use load profiles for the u.s. building stock: Methodology and results of model calibration, validation, and uncertainty quantification,” Technical Report, 3 2022.