Trading Data: Sources and Cleaning
A strategy is only as good as the data behind it. Learn where to get trading data, what makes it dirty, and how to clean it before a single backtest.
翻訳ビューではインタラクティブツールが動作しない場合があります。
Trading Data: Sources and Cleaning
Garbage in, garbage out. Every backtest is only as trustworthy as the data underneath it.
Before any signal, any model, any strategy — there is data. And market data, by default, is dirty: gaps, splits, bad ticks, timezone errors, and survivorship bias all conspire to make your backtest lie. Cleaning the data is half the work.
Sources of trading data
| Source | Best for | Cost |
|---|---|---|
| Yahoo Finance / Stooq | Daily US equities, prototyping | Free |
| Polygon.io, Alpha Vantage | Intraday equities, fundamentals | Freemium |
| Binance / Coinbase / Kraken APIs | Crypto OHLCV and trades | Free |
| Interactive Brokers API | Multi-asset live + historical | Broker fee |
| Tiingo, EOD Historical | End-of-day global | Subscription |
| Databento, Algoseek | Tick-level, futures, options | Paid |
Match granularity to your strategy: tick data for microstructure, 1-minute for intraday, daily for swing. Don't pay for ticks you'll never use.
The dirty data checklist
- Missing bars — exchange holidays, feed outages. Forward-fill or interpolate only when justified
- Bad ticks — spikes to zero or absurd values. Filter with a rolling-z threshold (e.g., reject |z| > 10)
- Stock splits — raw prices will show a fake 50% drop. Always use split-adjusted close
- Dividends — total return series differ from price series. Decide which you need
- Timezones — mix UTC and local and your joins silently break. Standardize on UTC
- Look-ahead bias — using a close timestamp that was actually tomorrow's open. Confirm each bar's timestamp is the start, not the end
Survivorship bias
If your historical universe is "today's S&P 500 constituents," your backtest will exclude every company that went bankrupt or was delisted. Returns look artificially high because the losers vanished.
Fix: use point-in-time constituent lists when available, or add a delisted return proxy. Free data almost never includes this — a hidden cost of free data.
Look-ahead and synchronization
When joining multiple symbols, alignment matters. A common bug: using the daily close of symbol A to enter on the same day's close of symbol B — impossible in real trading.
Rule: each bar can only use information available at bar close. Use .shift(1) for any feature derived from the close price itself.
Cleaning pipeline
df = df.tz_localize('UTC').tz_convert('UTC')
df = df[~df.index.duplicated(keep='first')]
df = df.resample('1min').last().ffill()
df['ret'] = np.log(df['close']).diff()
z = (df['close'] - df['close'].rolling(100).mean()) / df['close'].rolling(100).std()
df = df[z.abs() < 10]
Sanity checks before backtest
- Plot the cleaned series — does it look like the chart you know?
- Check annualized return of buy-and-hold — does it match published figures?
- Verify no future-looking features slipped in (run a "should be zero" lag check)
- Confirm universe membership matches reality at each timestamp
Summary
Free data is the most expensive data you'll ever use. Survivorship bias, bad ticks, and look-ahead leakage quietly inflate every backtest. Invest in clean, point-in-time data and a rigorous cleaning pipeline before you trust a single strategy result — because no signal survives garbage data.
Live Chart
Open full chart →Related market data, powered by TradingView.