Machine Learning Trading Pitfalls and Concrete Failures

Most ML trading models that look good in research fail in production for predictable reasons. The failures are not exotic — they are the same five patterns repeated. Recognize them with numeric thresholds and you catch the problem before capital does.

Pitfall 1: Label leakage

The most common killer. Examples: using close[-1] in a feature computed at t, normalizing with the full-series mean, or including a feature derived from the target. Symptom: out-of-sample Sharpe is 3+ but live is near zero. Test: train on 2018–2021, test on 2022. If test Sharpe exceeds train Sharpe by more than 30%, suspect leakage, not skill.

Pitfall 2: Random cross-validation

Random k-fold splits mix future into training. A model that has "seen" 2023 while predicting 2022 is worthless. Fix: use TimeSeriesSplit with a purged gap of at least the holding period plus 5 bars. A 5-day holding strategy needs a 10-bar gap minimum.

Pitfall 3: Hyperparameter tuning on the test set

You try 200 configurations, pick the best on the test set, and the test set is now training data. The Deflated Sharpe Ratio corrects for this: if you ran N trials, the required Sharpe to claim significance rises roughly with √(2·ln(N)). After 200 trials you need Sharpe > 2.5, not 1.0.

Pitfall 4: Regime decay

A gradient boosting model trained on 2010–2019 low-volatility equity markets breaks in 2020. Monitor rolling 60-day Sharpe in production; retire the model when it drops below 0.3 for two consecutive months. Retraining on rolling windows helps only if the regime persists — retraining into a transition gives you a model calibrated to the wrong environment.

Pitfall 5: Feature instability

A feature with high importance in training but high variance across walk-forward folds is not robust. Compute feature importance per fold; drop any feature whose importance rank swings by more than 5 positions across folds. Stable models need stable features.

Concrete diagnostic numbers

Train/test Sharpe ratio > 1.5x difference → overfitting or leakage
Walk-forward fold Sharpe standard deviation > 0.8 → unstable edge
Live Sharpe below 50% of backtest Sharpe after 100 trades → model is broken, not unlucky
Feature correlation > 0.9 with target → almost certainly leakage

When ML is the wrong tool

If you have under 500 labeled examples, under 3 years of data, or your edge is a known economic relationship, skip ML. A trend filter plus a risk rule will beat a deep model on small data, and you will understand why it works — which means you will know when it stops.

The discipline is not adding ML; it is refusing ML until the data, the validation, and the monitoring can support it.