System Robustness Testing with Monte Carlo

A single backtest gives one drawdown number from one historical trade sequence. Monte Carlo reveals the distribution of possible outcomes. Robustness testing is distinct from risk estimation: it asks whether the system survives, not just how bad a bad run gets.

Three Robustness Tests

1. Trade-Order Shuffle

Reshuffle the order of historical trades 1,000 times, keeping the same trades but reordering them. Recompute drawdown for each sequence.

What it tests: whether your backtest drawdown is a lucky trade ordering or representative.

Threshold: if the 95th-percentile shuffled drawdown exceeds 2x your backtest drawdown, the system is fragile to sequencing. The historical order flattered you.

2. Return Resampling (Bootstrap)

Sample trades with replacement from the historical distribution to construct 1,000 synthetic equity curves of equal length. Recompute Sharpe and final equity for each.

What it tests: the stability of the edge given the observed trade distribution.

Threshold: if more than 5% of resampled curves end below zero, the edge is too thin to trust. The system is one bad sample away from ruin.

3. Parameter Perturbation

Shift every parameter by 10% and 20% in both directions. Recompute Sharpe for each perturbation.

What it tests: whether the edge depends on exact parameter values.

Threshold: if a 10% perturbation drops Sharpe by more than 25%, the parameters are fit to noise. Reject or simplify.

How to Run Them

Trade-order shuffle and bootstrap require only your trade list (entry, exit, R-multiple). Any spreadsheet or Python script handles 1,000 iterations in seconds.
Parameter perturbation requires re-running the backtest, so automate it in your testing framework.
Always run all three. Passing one proves nothing; robustness is a conjunction.

Interpreting Results Together

A robust system passes all three:

Shuffled drawdown within 1.5x of backtest.
Fewer than 2% of bootstrapped curves negative.
Sharpe stable under 20% parameter perturbation.

A system that passes shuffle and bootstrap but fails perturbation is order-robust but parameter-fragile, likely a real edge with overfit parameters; re-optimize with a flatter surface. A system that fails shuffle is not a system, it is a lucky sequence.

Common Mistakes

Running Monte Carlo once and quoting the result. Use the 95th percentile, not the mean, for risk decisions.
Resampling returns instead of trades when trades are not independent (e.g., a trend system with serial correlation). For correlated trade streams, block-bootstrap with a block length matching the autocorrelation decay.
Treating Monte Carlo as a pass/fail for deployment. It is one input alongside walk-forward and live forward testing.

The Honest Output

Monte Carlo does not predict the future. It widens your understanding of what the past could have looked like. A system whose past could have been terrible under reordering is a system whose future can be terrible. Use the tests to reject fragile systems before they cost you capital.