When ML Forecasts Underperform Statistics: The Uncomfortable Truth

It's the pitch every demand planner has heard in the last two years: "Switch to AI-powered forecasting and watch your error rates drop by 50% overnight."

The promise is seductive. We want to believe that if we just feed enough data into a complex neural network, it will magically predict the unpredictable. But if you've actually tried to implement a pure Machine Learning (ML) solution for a standard SKU portfolio, you might have discovered an uncomfortable truth: sometimes, the expensive "black box" loses to a simple moving average.

You aren't doing it wrong. In fact, you're experiencing a phenomenon well-documented in data science but rarely discussed in sales decks. Complexity does not equal accuracy.

In this article, we'll break down why simpler statistical models often outperform complex ML algorithms, what the latest forecasting competitions tell us, and how to build a pragmatic forecasting strategy that uses the right tool for the job.

What the Competitions Actually Showed (M4 & M5)

The "Olympics" of forecasting are the M-Competitions (Makridakis Competitions), where research teams from around the world compete to predict thousands of time series. The results of the most recent competitions provide a reality check for the "AI is always better" crowd.

The M4 Competition: Stats and Hybrids Win

In the M4 competition (2018), pure Machine Learning methods performed surprisingly poorly. In fact, 12 of the top 17 performing methods were combinations of statistical approaches. The winner was a hybrid model (ES-RNN) that combined Exponential Smoothing (stats) with Neural Networks (ML).

Crucially, many pure Deep Learning submissions failed to beat a simple statistical baseline.

The M5 Competition: The Tides Turn (With a Catch)

By the M5 competition (2020), Machine Learning models (specifically gradient boosting trees like LightGBM) took the top spots.

However, there is a massive caveat. The M5 dataset was hierarchical retail data (Walmart sales) rich with external variables like prices, promotions, and events. ML won because it could ingest these external signals. For the vast majority of companies forecasting purely on sales history without clean external drivers, the M4 results—where stats held their ground—are often more relevant.

Four Scenarios Where Simple Stats Beat ML

If your boss is asking why you haven't "modernized" to a Deep Learning model yet, here are four technical reasons why a statistical approach might actually be the superior engineering choice.

1. Limited Historical Data (The "Cold Start" Problem)

Machine Learning models are data-hungry. They need massive datasets to learn patterns without memorizing noise. If you are forecasting a new product with only 6 months of history, a neural network will almost certainly overfit, finding patterns that aren't there.

A simple exponential smoothing model, however, can generate a reasonable baseline with just a few data points.

2. High Noise-to-Signal Ratio

In demand planning, "noise" is the random fluctuation in sales that cannot be predicted. "Signal" is the underlying trend or seasonality.

When you feed a noisy dataset to a complex ML model, the model often tries to "learn the noise," mistaking random spikes for repeatable patterns. Statistical models are designed specifically to smooth out this noise and isolate the trend.

3. Univariate Data (No External Drivers)

If your dataset is just "Date" and "Sales Quantity," ML has very little "feature space" to work with. The power of ML comes from finding correlations between sales and other factors (weather, price, competitors). If you lack that external data, a statistical ARIMA or Holt-Winters model is often mathematically optimal.

4. Interpretability Requirements

Try explaining to a VP of Sales why the forecast dropped 20% next month.

Statistical Model: "The trend is down, and we are entering a historically low seasonal period." (Accepted).
Deep Learning Model: "Neuron 42 in layer 3 fired negatively." (Rejected).

If your organization requires "white box" transparency to trust the numbers, stats win every time.

Why ML Models "Fail" (Technically Speaking)

It's not just about data quantity; it's about the nature of the algorithms.

Overfitting and Feature Engineering

In traditional statistics, you might estimate 2 or 3 parameters (Level, Trend, Seasonality). In a Deep Learning model, you might be estimating millions of weights. With that much flexibility, the model can perfectly memorize your past sales—including the outliers you want to ignore—resulting in terrible predictions for the future.

Furthermore, ML isn't "plug and play." It requires aggressive feature engineering. You have to manually create features like "lagged sales," "rolling averages," and "holiday flags." If you do this poorly, you get "garbage in, garbage out."

The "Transformer" Degeneration

Recent research from 2024 and 2025, including papers like "Why Do Transformers Fail to Forecast Time Series In-Context?", highlights that while Large Language Models (LLMs) are great for text, they often struggle with time series. They can fail to capture the scale of the data or "degenerate" into simple linear predictions that a basic regression could have handled at 1% of the compute cost.

The "Good Enough" Baseline

Before you spend $200k on a data science project, established best practice involves a "tournament" approach. You don't start with the complex model; you start with the baseline.

Naive Forecast: Predict that next month will be the same as this month.
Seasonal Naive: Predict that next month will be the same as the same month last year.
Exponential Smoothing (ETS): A standard statistical model.

Rule of Thumb: If your fancy ML model cannot beat the Seasonal Naive forecast by a significant margin (e.g., >10% reduction in error), it is not worth the deployment cost.

Signs Machine Learning Might Actually Help

We aren't anti-ML. At DemandPlan, we use it extensively. But we use it where it wins. You should consider moving from Stats to ML if:

You have external drivers: You have clean, historical data on price changes, promotions, competitor actions, or weather, and you believe these drive demand.
Complex interactions: Sales of Product A cannibalize Product B, but only when Product C is on sale. ML is great at these non-linear relationships.
Cross-learning: You have thousands of related SKUs (e.g., different sizes of the same shirt). A global ML model can learn seasonality from the group and apply it to a specific SKU with short history.

The DemandPlan Approach: Pragmatic Intelligence

We believe in Adaptive Hierarchy. We don't force a Deep Learning model on a slow-moving spare part, and we don't use a simple moving average for a high-volume, promotion-driven beverage.

Our engine runs a tournament for every time series. We test statistical baselines against machine learning candidates.

If the SKU is stable and history is short, we stick to Stats.
If the SKU is volatile and we have promotion data, we promote it to ML.
For most items, we use a Hybrid approach (like the M4 winner), using stats to de-seasonalize the data and ML to predict the residuals.

This isn't "AI for AI's sake." It's engineering.

Conclusion

Forecasting is about reducing uncertainty, not increasing complexity. While Machine Learning is a powerful tool in the modern demand planner's arsenal, it is not a silver bullet. For many SKUs—perhaps even the majority of your tail spend—simple statistical models are more accurate, more robust, and easier to explain.

Don't let the hype cycle dictate your engineering strategy. Start simple, measure relentlessly, and add complexity only when it pays rent.

Ready to stop fighting with your models? See how DemandPlan's Adaptive Hierarchy engine automatically selects the best model for every SKU, or schedule a demo to talk to a planner who gets it.

When ML Forecasts Underperform Statistics: The Uncomfortable Truth

When ML Forecasts Underperform Statistics: The Uncomfortable Truth

What the Competitions Actually Showed (M4 & M5)

The M4 Competition: Stats and Hybrids Win

The M5 Competition: The Tides Turn (With a Catch)

Four Scenarios Where Simple Stats Beat ML

1. Limited Historical Data (The "Cold Start" Problem)

2. High Noise-to-Signal Ratio

3. Univariate Data (No External Drivers)

4. Interpretability Requirements

Why ML Models "Fail" (Technically Speaking)

Overfitting and Feature Engineering

The "Transformer" Degeneration

The "Good Enough" Baseline

Signs Machine Learning Might Actually Help

The DemandPlan Approach: Pragmatic Intelligence

Conclusion

Ready to modernize your demand planning?

Related Articles

Choosing Forecast Dimensions | Granularity vs. Accuracy

AI in Demand Forecasting: Beyond the Hype (A Practitioner's Guide)

Demand Planning vs Forecasting: Key Differences