Hypothesis Testing and Sample Significance

A backtest that made money is not proof of edge. It's a hypothesis. The question is whether the result is significant.

Every backtest looks profitable. The job of statistics is to tell you whether that profit could plausibly have come from pure chance. Hypothesis testing is the formal tool for this judgment.

Setting up the test

Null hypothesis (H0): the strategy has zero edge — observed profit is random noise
Alternative (H1): the strategy has a real positive edge
Pick a significance level α (commonly 0.05) — the false-positive rate you'll tolerate

The test statistic typically uses the t-distribution when the sample is small and σ is estimated:

t = (x̄ − μ0) ÷ (s ÷ √n)

Where x̄ is the sample mean return per trade, μ0 is the hypothesized mean (often 0), s the sample standard deviation, and n the number of trades.

The p-value

The p-value is the probability of seeing a result at least this extreme if H0 were true.

p < 0.05 → reject H0 — evidence of a real edge
p > 0.05 → fail to reject H0 — could be noise

But beware: a low p-value is not a guarantee. Run the same flawed strategy on 20 different markets and one will appear "significant" purely by chance.

Sample size matters most

With a small sample, even a strong edge can fail to reach significance. With a huge sample, a useless edge can appear "significant" because the test has too much power. For trading:

Fewer than 30 trades → almost meaningless; treat with deep skepticism
100–300 trades → reasonable sample for most setups
Thousands of trades → watch out for tiny effects disguised as significance

Common trading mistakes

Multiple comparisons: testing 50 variations and reporting the best one as if you'd tested one. Correct with Bonferroni or false-discovery-rate adjustments
Survivorship bias in the test: only testing symbols that still trade today
Data snooping: the more parameters you tune, the more likely significance is an illusion
Ignoring dependence: overlapping returns violate the independence assumption and inflate the t-statistic

Practical takeaway

A profitable backtest is a hypothesis, not a conclusion. Compute the t-statistic and p-value. Require p < 0.05 after correcting for the number of variations you tested. Then — and only then — trade the strategy with small size and watch whether live results track the backtest.

Summary

Hypothesis testing gives you a disciplined way to ask, "Could this be luck?" It's not perfect, but it's far better than trusting a backtest on faith. Treat every strategy as a hypothesis to be tested, and let statistics — not hope — decide when an edge is real enough to trade.