← back to mock pineapple
Mock Pineapple · Step 4 · Maintenance Loop

The System Maintains Itself — Drift-Triggered Re-Tuning

Re-tuning on a calendar is wasteful. Re-tuning every time the model drifts is over-eager. The right answer is per-pair, per-model, drift-triggered — and the rules are different for every pair. This is the unsexy MLOps work that's actually load-bearing.
−71%
compute vs periodic re-tune
1.5×
drift threshold
2 weeks
consecutive trigger required
+62%
SGD/SARIMA improvement
−150%
JPY/SARIMA hurt by re-tune
Walk-forward MAPE — the source of truth

Drift detection requires a stable measurement. Single-window backtests are too noisy — a model that randomly hit a calm month looks great; a model that hit a rate-decision week looks terrible. Walk-forward across 10+ cutoff dates averages out the noise. This is the only number the drift detector trusts.

# run_weekly_retrain.py — Step 1

def walk_forward_mape(pair, model, n_cutoffs=10):
    cutoffs = sample_cutoffs(start='2023-01-01', end='today', n=n_cutoffs)
    mapes = []
    for cutoff in cutoffs:
        train_df = data[data.date < cutoff]
        test_df  = data[(data.date >= cutoff) & (data.date < cutoff + 30d)]
        model.fit(train_df)
        forecast = model.predict(horizon=30)
        mape = mean(abs(forecast - test_df.actual) / test_df.actual)
        mapes.append(mape)
    return median(mapes)        # median, not mean — robust to single bad cutoff
Why median, not mean: one bad cutoff (say, a regime shift week) shouldn't make the system re-tune a model that's been stable for 9 other cutoffs. Median is robust; mean is trigger-happy.
Drift detection — 1.5× threshold, 2 consecutive weeks

A single-week MAPE spike isn't drift, it's noise. Drift is persistent. The detector compares this week's walk-forward MAPE to the baseline tuned MAPE; if the ratio exceeds 1.5× for two consecutive weeks, it triggers an Optuna re-tune. Baselines live in tuned_hyperparameters.json; thresholds in config.json.

# source.monitoring.drift_retune

def check_drift(pair, model):
    current  = walk_forward_mape(pair, model)
    baseline = tuned_hyperparameters[(model, pair)]['mape']

    ratio = current / baseline
    history.append(ratio >= 1.5)        # rolling 2-week window

    if len(history) >= 2 and all(history[-2:]):
        # Persistent drift — re-tune
        log_warning(f"DRIFT DETECTED: {model}/{pair} — "
                    f"MAPE {current:.4f} vs baseline {baseline:.4f} "
                    f"({ratio:.1f}x) for 2 consecutive weeks → RE-TUNE TRIGGERED")
        return True
    return False

def retune(pair, model):
    # Optuna 30 trials — narrower than initial 100
    best = optuna.optimize(model, pair, n_trials=30, walk_forward=True)
    tuned_hyperparameters[(model, pair)] = best
    save(tuned_hyperparameters)         # atomic JSON write
Per-pair, per-model rules — what 6 years of simulation taught

Before deploying drift-triggered re-tuning, I simulated 6 years of weekly tunes against the no-tune baseline, per (model, pair). The result was strikingly pair-specific — the same drift detector that helps SGD/SARIMA actively hurts JPY/SARIMA. The production rules below are encoded directly in drift_retune.py.

SARIMA / JPY
RULE — NEVER RE-TUNE
Re-tuning SARIMA on JPY hurts MAPE by 150% in the 6-year simulation. JPY's autoregressive structure is unusually stable — once tuned, it stays good. Each re-tune introduces noise from finite-sample Optuna trials.

Encoded as: SKIP_RETUNE = {('sarima', 'JPY')} hardcoded skip set.
SARIMA / SGD
RULE — DRIFT-TRIGGERED
SGD/SARIMA +62% MAPE improvement with drift triggers vs no-retune baseline. Drift triggers fired 3 times in 6 years; periodic re-tuning would have fired 7 times — same outcome with less compute.

SGD's monetary policy regime shifts more than JPY's, so the model legitimately needs occasional updates.
LightGBM / JPY
RULE — DRIFT-TRIGGERED
LightGBM/JPY +76% improvement with only 2 re-tunes in 6 years. Tree models are more sensitive to feature distribution shifts than autoregressive models — the macro feature set evolves with the world (VIX regime, yield-curve shape).

Drift trigger catches it without retuning every Monday.
−71% compute vs periodic re-tuning

The weekly retrain script trains all 24 model-pair combinations (3 models × 8 pairs) every Monday — that part is non-negotiable, it's how we measure drift. But re-tuning hyperparameters via Optuna is the expensive step: 30 trials × ~80 seconds each = ~40 minutes of compute per re-tune. Doing this for every pair every week is wasteful.

StrategyRe-tunes per yearComputeBest vs no-retune
Periodic (every Monday)52 × 24 = 1,248~830 hrs/yearmarginal
Drift-triggered (1.5× × 2 weeks)~10/year (in production)~6.6 hrs/year (−71%)+62% / +76% on the right pairs
Per-pair rules (no JPY/SARIMA)~7/year~4.7 hrs/year (−80%)strictly better — never the regression cases
The principle: "stationary" isn't a property of FX markets — it's a property of this pair × this model × this feature set. The drift detector and per-pair rules together encode that, so I don't have to remember it.
Mar 30 incident — drift caught SGD before I did

The most recent successful retrain logged a real drift event. This is what the system was built to do.

SGD models drifted 7.9× and 1.8× — system auto-recovered overnight

DRIFT DETECTED: sarima/SGD — MAPE 0.0220 vs baseline 0.0028 (7.9x) for 2 consecutive weeks → RE-TUNE TRIGGERED

DRIFT DETECTED: lightgbm/SGD — MAPE 0.0041 vs baseline 0.0023 (1.8x) for 2 consecutive weeks → RE-TUNE TRIGGERED

Both pairs re-tuned in the same retrain cycle (~80 seconds each, 30 Optuna trials, walk-forward validated). New baselines written to tuned_hyperparameters.json. Dashboards regenerated. Total human time involved: zero. I noticed it the next morning when I looked at the regen log.

This is what "MLOps" should mean: the system flags its own failures, fixes what it can, and only escalates when the rules don't apply.

Full weekly retrain — five sequential steps
# Mondays 9:30 AM via LaunchAgent — com.mockpineapple.weekly-retrain

bash projects/mock_pineapple/run_pipeline.sh --retrain

  STEP 1 · walk-forward MAPE per (model × pair) over 10 cutoffs
        24 fits × ~1-2s each ≈ 30s

  STEP 2 · drift_retune.py
        for each (model, pair):
            check_drift() → returns True/False
            if True and (model, pair) ∉ SKIP_RETUNE: queue_retune()

  STEP 3 · Optuna re-tune queued items (sequential, ~40-60s each)
        Mar 30 example: 3 items queued, 2 actually re-tuned (SARIMA/SGD, LightGBM/SGD)
        save tuned_hyperparameters.json (atomic write)

  STEP 4 · refit production models with new hyperparameters
        write data/trained_models/YYYY-WXX/manifest.json
        sarima.pkl, prophet.pkl, lgbm.pkl per (pair, model)

  STEP 5 · regenerate dashboards
        dashboard/generate_analyst_dashboard.py
        → mape_dashboard.html (model quality, ensemble weights)
        → analyst_report.html  (7-finding narrative)
        → bucketed_mape.html   (per-horizon bucket detail)

  Total wall time: ~25-40 min (depends on how many re-tunes triggered)
What this loop taught me

Three rules that came out of this work

1. Re-tune is not a default, it's a triggered action. Default re-tuning destroys stable models. Treating "Monday morning" as a re-tune signal is a bug, not a feature.

2. Per-pair rules beat global rules. The same drift threshold (1.5×) is correct for SGD and wrong for JPY/SARIMA. Encoding pair-specific behavior in SKIP_RETUNE took 30 minutes and prevented a class of regressions I'd otherwise have to babysit.

3. The dashboard regenerate is part of the loop. If the system re-tunes overnight but the dashboard still shows last week's MAPE, the human (me) doesn't trust the numbers. Regenerating analyst_report.html every Monday closes the loop.

What the maintenance loop delivers

71% less compute than periodic re-tuning. Per-pair rules prevent the regression cases (JPY/SARIMA gets worse with re-tuning, so the system never tries). Dashboards regenerate automatically so the numbers Diane sees are always current. The Mar 30 incident shows the loop working in production: SGD drift detected, two models re-tuned, no human in the loop. This is what "MLOps on a laptop" looks like.