Emergent Mind

Forecasting with Deep Learning: Beyond Average of Average of Average Performance

(2406.16590)
Published Jun 24, 2024 in stat.ML and cs.LG

Abstract

Accurate evaluation of forecasting models is essential for ensuring reliable predictions. Current practices for evaluating and comparing forecasting models focus on summarising performance into a single score, using metrics such as SMAPE. We hypothesize that averaging performance over all samples dilutes relevant information about the relative performance of models. Particularly, conditions in which this relative performance is different than the overall accuracy. We address this limitation by proposing a novel framework for evaluating univariate time series forecasting models from multiple perspectives, such as one-step ahead forecasting versus multi-step ahead forecasting. We show the advantages of this framework by comparing a state-of-the-art deep learning approach with classical forecasting techniques. While classical methods (e.g. ARIMA) are long-standing approaches to forecasting, deep neural networks (e.g. NHITS) have recently shown state-of-the-art forecasting performance in benchmark datasets. We conducted extensive experiments that show NHITS generally performs best, but its superiority varies with forecasting conditions. For instance, concerning the forecasting horizon, NHITS only outperforms classical approaches for multi-step ahead forecasting. Another relevant insight is that, when dealing with anomalies, NHITS is outperformed by methods such as Theta. These findings highlight the importance of aspect-based model evaluation.

Probability of NHITS outperforming other approaches across various time series

Overview

  • The paper introduces a new framework for evaluating univariate time series forecasting models, highlighting deficiencies in conventional methods that rely on single average scores.

  • This framework was applied to compare a state-of-the-art deep learning model, NHITS, with traditional methods like ARIMA and ETS, revealing that NHITS generally outperforms these methods, particularly in multi-step ahead forecasting.

  • It was found that while NHITS excels overall, its performance varies under different conditions, such as anomalies and frequency of data, suggesting a need for more nuanced approaches in building forecasting models.

Forecasting with Deep Learning: Beyond Average of Average of Average Performance

The paper "Forecasting with Deep Learning: Beyond Average of Average of Average Performance" by Vitor Cerqueira, Luis Roque, and Carlos Soares presents an innovative framework for evaluating univariate time series forecasting models. This new framework addresses deficiencies in conventional evaluation methods and provides a more nuanced understanding of model performance under various conditions.

Core Hypothesis and Methodology

The traditional evaluation of forecasting models often relies on summarizing performance metrics such as SMAPE into a single average score across all samples. The authors hypothesize that this approach dilutes meaningful information about models' relative performance, especially in scenarios where overall accuracy can be misleading. To address this, they propose evaluating forecasting models from multiple perspectives, including one-step ahead versus multi-step ahead forecasting.

They demonstrate the advantages of their framework by comparing a state-of-the-art deep learning model, NHITS, with classical methods like ARIMA, ETS, SNaive, RWD, SES, and Theta. Their extensive experiments reveal that while NHITS generally performs best, its advantages vary depending on specific forecasting conditions. For instance, although NHITS outperforms traditional methods for multi-step ahead forecasting, it is outperformed by the Theta method when dealing with anomalies.

Evaluation Metrics and Results

The research includes an extensive empirical analysis covering benchmark datasets such as M3, Tourism, and M4. These datasets encompass a diversity of time series with different sampling frequencies, providing a robust basis for evaluation.

Key evaluation metrics employed include:

  • Overall SMAPE: NHITS exhibits superior performance across all time series, achieving better SMAPE scores than classical methods.
  • SMAPE Expected Shortfall: NHITS holds a competitive edge in worst-case scenarios, suggesting a robust performance profile.
  • Win/Loss Ratios: NHITS shows comparable performance even with a 5% ROPE, indicating reasonable reliability across diverse conditions.

Despite its strong overall performance, NHITS is notably less competitive when forecasting anomalous observations, where it trails behind methods like SES and Theta.

Detailed Analysis

The paper reveals several critical insights through detailed analyses:

  1. Sampling Frequency: NHITS is most effective for higher-frequency data (e.g., monthly), but its advantage diminishes for lower-frequency data (e.g., yearly).
  2. Forecasting Horizon: The model excels at multi-step ahead forecasting, aligning with its design optimizations for long-horizon predictions.
  3. Relative Performance Variability: While NHITS performs best overall, it is outperformed in a significant percentage of time series, illustrating that no single model uniformly dominates.
  4. Problem Difficulty: NHITS' performance advantage reduces on more complex problems, based on the worst-case performance of a baseline SNaive model.

Implications for Future Research

The findings from this paper suggest several avenues for future research in AI and time series forecasting:

  • Model Robustness: Developing deep learning models better equipped to handle anomalous data can enhance their applicability in real-world scenarios where outliers are common.
  • Granular Evaluation Frameworks: Broader adoption of multi-perspective evaluation frameworks can drive the creation of more versatile and reliable forecasting models.
  • Adaptive Techniques: Exploring adaptive techniques that leverage the strengths of classical methods in specific scenarios (e.g., anomalies) could create hybrid models with superior performance.

Conclusion

In summary, the paper presents a comprehensive and insightful analysis of time series forecasting models. By moving beyond average performance metrics, this research highlights the nuanced strengths and weaknesses of state-of-the-art deep learning methods relative to classical approaches. The novel evaluation framework proposed serves as a valuable tool for advancing both theoretical understanding and practical application in the field of time series forecasting.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube