Back to BlogMachine Learning

From Isolation Forest to XGBoost: Deploying ML Anomaly Detection in Production

Cloud Resources EngineeringFebruary 28, 202612 min read

Most enterprise monitoring is backward-looking. Something breaks, an alert fires, and a human scrambles to figure out what happened. In storage environments where outages cost $100K-$500K per hour, this reactive approach is financially irresponsible.

We built a production ML pipeline that flips this model. Our system detects anomalies, predicts capacity exhaustion, and scores risk — before problems become incidents. Here's the technical deep dive.

The Three-Model Architecture

We deploy three complementary ML models, each purpose-built for a different prediction task. Together, they provide a comprehensive risk and anomaly picture of the entire SAN environment.

Model 1: Isolation Forest for Anomaly Detection

Isolation Forest is elegant in its simplicity: anomalies are “few and different,” so they're easier to isolate. We train on 2M+ historical data points, extracting statistical features (mean, max, standard deviation) for each entity type.

For ports, we track utilization, BB credit zeros, CRC errors, and discards. For arrays, we monitor CPU, latency, IOPS, and cache hit rates. For pools, we watch utilization, subscription ratios, and days-to-full.

The output is an anomaly score from 0-100 for every entity. A port with a score of 95 means it's behaving radically differently from its peers — and that difference warrants immediate investigation.

Model 2: Holt-Winters for Capacity Forecasting

Linear regression for capacity forecasting is a rookie mistake. Real storage environments have seasonality — backup windows spike utilization nightly, month-end processing creates weekly peaks, and quarterly archiving creates longer cycles.

Holt-Winters Exponential Smoothing captures all three components: level, trend, and seasonality. We produce 30/60/90-day forecasts with 95% confidence intervals, giving capacity planners both the prediction and the uncertainty band.

When a pool is forecast to hit 85% utilization in 42 days with 95% confidence, that's not a guess — it's a statistically grounded procurement trigger.

Model 3: XGBoost for Risk Scoring

Rule-based risk scoring assigns static weights: “slow drain = high risk,” “BB credit zeros = medium risk.” But risk interactions are non-linear. A slow drain issue on a port that's also showing BB credit starvation on a critical ISL is exponentially more dangerous than either alone.

XGBoost learns these non-linear interactions from 150 labeled fault events. The gradient boosted decision trees capture complex feature interactions that no human-written rule set could match. And the feature importance output tells us why each entity is risky.

The Training Pipeline

Our models train automatically on startup, processing the full historical dataset in approximately 53 seconds:

  • Isolation Forest: ~28 seconds (2M+ data points, statistical feature extraction)
  • Holt-Winters: ~5 seconds (per-pool time series fitting)
  • XGBoost: ~20 seconds (150 labeled faults + metric features)

For our trading systems, we go further with nightly retraining — models update every 24 hours with the latest market data, using Bayesian weight updating to balance historical patterns with recent signals.

ML vs. Rule-Based: The Comparison

After running both approaches in parallel, the differences were stark:

  • Anomaly detection: ML caught 47 anomalies that threshold-based rules missed entirely (5.1% anomaly rate across 919 entities)
  • Capacity forecasting: ML predictions were 3x more accurate than linear regression, especially around seasonal transitions
  • Risk scoring: XGBoost identified 23% more critical risks than static rules, with zero false positives on the top-10 list

Lessons from Production

  • Start with rule-based, augment with ML. Rules are explainable and fast to deploy. ML adds the pattern recognition that rules can't match.
  • Feature engineering is 80% of the work. The ML models are simple; the canonical metrics and statistical features are where the value lives.
  • Anomaly scores need context. A score of 95 on a test port is noise. A score of 95 on a production ISL is a potential major incident. Combine ML scores with business context.
  • Retrain or decay. Models that don't retrain become stale. Build retraining into the pipeline from day one.

Production ML isn't about complex algorithms. It's about the right model for the right problem, trained on the right data, deployed with the right context. When you get that combination right, you go from reactive firefighting to predictive intelligence — and that transformation is worth millions.