Machine Learning Trading Pitfalls and Concrete Failures
Concrete machine learning trading pitfalls — data leakage, regime decay, overfitting — with specific failure patterns and numeric thresholds to detect them.
翻訳ビューではインタラクティブツールが動作しない場合があります。
Machine Learning Trading Pitfalls and Concrete Failures
Most ML trading models that look good in research fail in production for predictable reasons. The failures are not exotic — they are the same five patterns repeated. Recognize them with numeric thresholds and you catch the problem before capital does.
Pitfall 1: Label leakage
The most common killer. Examples: using close[-1] in a feature computed at t, normalizing with the full-series mean, or including a feature derived from the target. Symptom: out-of-sample Sharpe is 3+ but live is near zero. Test: train on 2018–2021, test on 2022. If test Sharpe exceeds train Sharpe by more than 30%, suspect leakage, not skill.
Pitfall 2: Random cross-validation
Random k-fold splits mix future into training. A model that has "seen" 2023 while predicting 2022 is worthless. Fix: use TimeSeriesSplit with a purged gap of at least the holding period plus 5 bars. A 5-day holding strategy needs a 10-bar gap minimum.
Pitfall 3: Hyperparameter tuning on the test set
You try 200 configurations, pick the best on the test set, and the test set is now training data. The Deflated Sharpe Ratio corrects for this: if you ran N trials, the required Sharpe to claim significance rises roughly with √(2·ln(N)). After 200 trials you need Sharpe > 2.5, not 1.0.
Pitfall 4: Regime decay
A gradient boosting model trained on 2010–2019 low-volatility equity markets breaks in 2020. Monitor rolling 60-day Sharpe in production; retire the model when it drops below 0.3 for two consecutive months. Retraining on rolling windows helps only if the regime persists — retraining into a transition gives you a model calibrated to the wrong environment.
Pitfall 5: Feature instability
A feature with high importance in training but high variance across walk-forward folds is not robust. Compute feature importance per fold; drop any feature whose importance rank swings by more than 5 positions across folds. Stable models need stable features.
Concrete diagnostic numbers
- Train/test Sharpe ratio > 1.5x difference → overfitting or leakage
- Walk-forward fold Sharpe standard deviation > 0.8 → unstable edge
- Live Sharpe below 50% of backtest Sharpe after 100 trades → model is broken, not unlucky
- Feature correlation > 0.9 with target → almost certainly leakage
When ML is the wrong tool
If you have under 500 labeled examples, under 3 years of data, or your edge is a known economic relationship, skip ML. A trend filter plus a risk rule will beat a deep model on small data, and you will understand why it works — which means you will know when it stops.
The discipline is not adding ML; it is refusing ML until the data, the validation, and the monitoring can support it.
Live Chart
Open full chart →Related market data, powered by TradingView.