blog · ~6 min read

Trading Data: Sources and Cleaning

A strategy is only as good as the data behind it. Learn where to get trading data, what makes it dirty, and how to clean it before a single backtest.

T By tradernewbie · Curated for beginners
#algorithmic#quant-trading
Este artículo está en inglés. ¿Verlo en tu idioma? Google Translate →

Las herramientas interactivas pueden no funcionar en la vista traducida.

Trading Data: Sources and Cleaning

Garbage in, garbage out. Every backtest is only as trustworthy as the data underneath it.

Before any signal, any model, any strategy — there is data. And market data, by default, is dirty: gaps, splits, bad ticks, timezone errors, and survivorship bias all conspire to make your backtest lie. Cleaning the data is half the work.

Sources of trading data

Source Best for Cost
Yahoo Finance / Stooq Daily US equities, prototyping Free
Polygon.io, Alpha Vantage Intraday equities, fundamentals Freemium
Binance / Coinbase / Kraken APIs Crypto OHLCV and trades Free
Interactive Brokers API Multi-asset live + historical Broker fee
Tiingo, EOD Historical End-of-day global Subscription
Databento, Algoseek Tick-level, futures, options Paid

Match granularity to your strategy: tick data for microstructure, 1-minute for intraday, daily for swing. Don't pay for ticks you'll never use.

The dirty data checklist

  1. Missing bars — exchange holidays, feed outages. Forward-fill or interpolate only when justified
  2. Bad ticks — spikes to zero or absurd values. Filter with a rolling-z threshold (e.g., reject |z| > 10)
  3. Stock splits — raw prices will show a fake 50% drop. Always use split-adjusted close
  4. Dividends — total return series differ from price series. Decide which you need
  5. Timezones — mix UTC and local and your joins silently break. Standardize on UTC
  6. Look-ahead bias — using a close timestamp that was actually tomorrow's open. Confirm each bar's timestamp is the start, not the end

Survivorship bias

If your historical universe is "today's S&P 500 constituents," your backtest will exclude every company that went bankrupt or was delisted. Returns look artificially high because the losers vanished.

Fix: use point-in-time constituent lists when available, or add a delisted return proxy. Free data almost never includes this — a hidden cost of free data.

Look-ahead and synchronization

When joining multiple symbols, alignment matters. A common bug: using the daily close of symbol A to enter on the same day's close of symbol B — impossible in real trading.

Rule: each bar can only use information available at bar close. Use .shift(1) for any feature derived from the close price itself.

Cleaning pipeline

df = df.tz_localize('UTC').tz_convert('UTC')
df = df[~df.index.duplicated(keep='first')]
df = df.resample('1min').last().ffill()
df['ret'] = np.log(df['close']).diff()
z = (df['close'] - df['close'].rolling(100).mean()) / df['close'].rolling(100).std()
df = df[z.abs() < 10]

Sanity checks before backtest

  1. Plot the cleaned series — does it look like the chart you know?
  2. Check annualized return of buy-and-hold — does it match published figures?
  3. Verify no future-looking features slipped in (run a "should be zero" lag check)
  4. Confirm universe membership matches reality at each timestamp

Summary

Free data is the most expensive data you'll ever use. Survivorship bias, bad ticks, and look-ahead leakage quietly inflate every backtest. Invest in clean, point-in-time data and a rigorous cleaning pipeline before you trust a single strategy result — because no signal survives garbage data.

Related market data, powered by TradingView.

Educational content · Not financial advice · Trade at your own risk