Hypothesis Testing and Sample Size for Trading

A backtest showing a 55% win rate over 30 trades proves nothing. The same rate over 1,000 trades is strong evidence. Hypothesis testing and sample size determine whether an observed edge is real or noise.

Setting Up the Test

Frame every edge question as a null hypothesis:

H0: the system's true win rate equals 50% (no edge).
H1: the win rate differs from 50% (edge exists).

Compute the z-statistic:

z = (p̂ − p0) / sqrt(p0(1−p0)/n)

where p̂ is the observed win rate, p0 is 0.5, and n is the trade count. A z above 1.96 rejects H0 at the 5% level.

A 55% win rate over 30 trades gives z ≈ 0.55 (not significant); over 400 trades, z ≈ 2.0 (significant). Trade count is the lever.

Minimum Sample Size

Before collecting data, compute the sample size needed to detect a meaningful edge. Use power analysis with:

Significance level (α): 0.05, the false positive rate you accept.
Power (1−β): 0.80, the probability of detecting a real edge. Below 0.80 you miss real edges too often.
Effect size: the smallest edge worth detecting. For win rate, an effect of 0.05 (55% vs 50%) is a reasonable floor.

For detecting a 55% win rate against a 50% null at α = 0.05 and power = 0.80, the required sample is roughly 780 trades. Smaller counts cannot distinguish a 55% edge from luck.

Type I and Type II Errors in Trading

Type I (false positive): concluding an edge exists when it does not, trading a worthless system and losing to costs. Controlled by α.
Type II (false negative): missing a real edge and discarding a profitable system. Controlled by power.

Retail traders over-weight Type I (fear of false edges) and abandon real edges too early. Set α and power explicitly; a common balance is α = 0.05, power = 0.80, effect size = 0.05.

Multiple Testing Correction

Every parameter you test inflates the false positive rate. Testing 20 variations at α = 0.05 produces, on average, one false positive by chance. Apply a Bonferroni correction: divide α by the number of tests, so 20 tests require p below 0.0025.

This is why wide grid optimization produces "significant" results that fail live: the significance is a multiple-testing artifact.

Practical Workflow

State the effect size you care about before testing.
Compute the required sample size for that effect.
Collect at least that many trades (backtest or forward).
Run the test with multiple-testing correction if you tried many variants.
Only declare an edge real if it survives the sample size and correction.

The Honest Outcome

Most "edges" found in 50-100 trades fail this test, correctly; most apparent edges are noise. Systems that survive proper hypothesis testing are rare and far more likely to persist, and trading one beats trading an edge that merely looked good on a chart.