Trading Data: Sources and Cleaning

Garbage in, garbage out. Every backtest is only as trustworthy as the data underneath it.

Before any signal, any model, any strategy — there is data. And market data, by default, is dirty: gaps, splits, bad ticks, timezone errors, and survivorship bias all conspire to make your backtest lie. Cleaning the data is half the work.

Sources of trading data

Source	Best for	Cost
Yahoo Finance / Stooq	Daily US equities, prototyping	Free
Polygon.io, Alpha Vantage	Intraday equities, fundamentals	Freemium
Binance / Coinbase / Kraken APIs	Crypto OHLCV and trades	Free
Interactive Brokers API	Multi-asset live + historical	Broker fee
Tiingo, EOD Historical	End-of-day global	Subscription
Databento, Algoseek	Tick-level, futures, options	Paid

Match granularity to your strategy: tick data for microstructure, 1-minute for intraday, daily for swing. Don't pay for ticks you'll never use.

The dirty data checklist

Missing bars — exchange holidays, feed outages. Forward-fill or interpolate only when justified
Bad ticks — spikes to zero or absurd values. Filter with a rolling-z threshold (e.g., reject |z| > 10)
Stock splits — raw prices will show a fake 50% drop. Always use split-adjusted close
Dividends — total return series differ from price series. Decide which you need
Timezones — mix UTC and local and your joins silently break. Standardize on UTC
Look-ahead bias — using a close timestamp that was actually tomorrow's open. Confirm each bar's timestamp is the start, not the end

Survivorship bias

If your historical universe is "today's S&P 500 constituents," your backtest will exclude every company that went bankrupt or was delisted. Returns look artificially high because the losers vanished.

Fix: use point-in-time constituent lists when available, or add a delisted return proxy. Free data almost never includes this — a hidden cost of free data.

Look-ahead and synchronization

When joining multiple symbols, alignment matters. A common bug: using the daily close of symbol A to enter on the same day's close of symbol B — impossible in real trading.

Rule: each bar can only use information available at bar close. Use .shift(1) for any feature derived from the close price itself.

Cleaning pipeline

df = df.tz_localize('UTC').tz_convert('UTC')
df = df[~df.index.duplicated(keep='first')]
df = df.resample('1min').last().ffill()
df['ret'] = np.log(df['close']).diff()
z = (df['close'] - df['close'].rolling(100).mean()) / df['close'].rolling(100).std()
df = df[z.abs() < 10]

Sanity checks before backtest

Plot the cleaned series — does it look like the chart you know?
Check annualized return of buy-and-hold — does it match published figures?
Verify no future-looking features slipped in (run a "should be zero" lag check)
Confirm universe membership matches reality at each timestamp

Summary

Free data is the most expensive data you'll ever use. Survivorship bias, bad ticks, and look-ahead leakage quietly inflate every backtest. Invest in clean, point-in-time data and a rigorous cleaning pipeline before you trust a single strategy result — because no signal survives garbage data.