Algo Strategy Monitoring and Exception Handling in Production
Production algo monitoring and exception handling covers alert thresholds, error classification, kill switches, and recovery procedures for live systems.
As ferramentas interativas podem não funcionar na vista traduzida.
Algo Strategy Monitoring and Exception Handling in Production
A live algo is judged by what it does when something breaks, not by what it does when everything works. Monitoring is not a dashboard — it is a set of automatic responses to specific failure conditions, each with a threshold and a defined action.
The four monitoring layers
1. Infrastructure. CPU, memory, disk, network, clock drift. Alert if NTP drift exceeds 50ms, memory usage exceeds 80%, or the heartbeat from any stage stops for 30 seconds. Action: page the operator and freeze new orders.
2. Connectivity. Broker websocket, REST latency, order ack time. Alert if websocket silent for 10 seconds, REST round-trip exceeds 2× the rolling median, or order acknowledgment exceeds 5 seconds. Action: cancel resting orders and halt submissions.
3. Strategy behavior. Rolling Sharpe, slippage vs expected, fill rate, rejection rate. Alert if rolling 50-trade Sharpe drops below 0, realized slippage exceeds 2× backtested, or rejection rate exceeds 5%. Action: reduce position size by 50% pending review.
4. PnL. Realized daily PnL and drawdown from peak. Alert if daily loss hits 2% (warning), 3% (halt new entries), or 5% (flatten all). These are hard limits, not soft warnings.
Error classification
Every exception should map to one of four classes:
- Transient (network blip, rate limit): retry with backoff, max 3 attempts in 30 seconds.
- Data quality (gap, stale quote, bad tick): skip the signal, log, continue. Never trade on corrupt data.
- Order rejection (insufficient margin, precision): log the broker message, fix the order, do not retry blindly.
- Unknown: halt. An error you have not classified is an error you cannot safely recover from.
The kill switch hierarchy
Three levels, increasing severity:
- Soft kill: stop new entries, let existing positions run to target. Use for strategy-level anomalies.
- Hard kill: cancel all open orders, do not flatten. Use for connectivity or data problems.
- Flat-all: cancel orders and market-out every position. Use for infrastructure failure or PnL breach. Bind this to a watchdog that fires if the main process is unresponsive for 30 seconds — a frozen bot is more dangerous than no bot.
Recovery procedure
After any halt, do not auto-resume. Run a manual checklist:
- Reconcile broker positions against internal state.
- Confirm the cause is fixed, not just gone.
- Restart in paper mode for 30 minutes.
- Scale back to 25% size for the first session.
What to log
Every order: timestamp, intended action, broker ack, fill price, slippage vs signal price. Every exception: type, context, action taken, resolution. Without this log, post-mortems are guesswork and the same bug recurs next month.
A monitored algo fails safely. An unmonitored algo fails catastrophically. The 50 lines of alerting code are the cheapest insurance in the system.
Live Chart
Open full chart →Related market data, powered by TradingView.