Algo Strategy Monitoring and Exception Handling in Production

A live algo is judged by what it does when something breaks, not by what it does when everything works. Monitoring is not a dashboard — it is a set of automatic responses to specific failure conditions, each with a threshold and a defined action.

The four monitoring layers

1. Infrastructure. CPU, memory, disk, network, clock drift. Alert if NTP drift exceeds 50ms, memory usage exceeds 80%, or the heartbeat from any stage stops for 30 seconds. Action: page the operator and freeze new orders.

2. Connectivity. Broker websocket, REST latency, order ack time. Alert if websocket silent for 10 seconds, REST round-trip exceeds 2× the rolling median, or order acknowledgment exceeds 5 seconds. Action: cancel resting orders and halt submissions.

3. Strategy behavior. Rolling Sharpe, slippage vs expected, fill rate, rejection rate. Alert if rolling 50-trade Sharpe drops below 0, realized slippage exceeds 2× backtested, or rejection rate exceeds 5%. Action: reduce position size by 50% pending review.

4. PnL. Realized daily PnL and drawdown from peak. Alert if daily loss hits 2% (warning), 3% (halt new entries), or 5% (flatten all). These are hard limits, not soft warnings.

Error classification

Every exception should map to one of four classes:

Transient (network blip, rate limit): retry with backoff, max 3 attempts in 30 seconds.
Data quality (gap, stale quote, bad tick): skip the signal, log, continue. Never trade on corrupt data.
Order rejection (insufficient margin, precision): log the broker message, fix the order, do not retry blindly.
Unknown: halt. An error you have not classified is an error you cannot safely recover from.

The kill switch hierarchy

Three levels, increasing severity:

Soft kill: stop new entries, let existing positions run to target. Use for strategy-level anomalies.
Hard kill: cancel all open orders, do not flatten. Use for connectivity or data problems.
Flat-all: cancel orders and market-out every position. Use for infrastructure failure or PnL breach. Bind this to a watchdog that fires if the main process is unresponsive for 30 seconds — a frozen bot is more dangerous than no bot.

Recovery procedure

After any halt, do not auto-resume. Run a manual checklist:

Reconcile broker positions against internal state.
Confirm the cause is fixed, not just gone.
Restart in paper mode for 30 minutes.
Scale back to 25% size for the first session.

What to log

Every order: timestamp, intended action, broker ack, fill price, slippage vs signal price. Every exception: type, context, action taken, resolution. Without this log, post-mortems are guesswork and the same bug recurs next month.

A monitored algo fails safely. An unmonitored algo fails catastrophically. The 50 lines of alerting code are the cheapest insurance in the system.