Three Models, One Ensemble — Building the FX Forecaster
38 features, three forecasting models, walk-forward validated, Optuna-tuned, ensemble-weighted by horizon. The biggest accuracy lift came from a question Diane asked mid-sprint — not from the planned roadmap.
38
features
3
models per pair
−90%
PHP MAPE after Optuna
−55%
SGD MAPE after exogenous
10+
walk-forward cutoffs
Feature inventory — 38 features in 4 classes
Naively dumping every available signal into LightGBM hurts more than it helps. The features below are walk-forward A/B-tested — three feature modules were shipped without ablation and two turned out to be dead weight (SARIMA residuals, CEEMDAN). The script run_feature_ablation.py is now a permanent gate.
Lagged price
5 features
Lags t-1, t-2, t-5, t-10, t-22 (1 month). Captures momentum and mean-reversion at multiple time scales. Always-on for SARIMA, gated by feature-importance for LightGBM.
Rolling statistics
9 features
Mean, std, skew over 5/22/63 windows. Volatility regime signal. The 22-day std turned out to be the strongest single LightGBM feature on JPY (carry-related).
Macro exogenous (Diane's question)
12 features
DXY, VIX, US 10Y Treasury yield — levels + 1d/7d/30d deltas. VIX regime flag (≥20 = stress). Gold/oil ratio. Cut LightGBM MAPE 14-16% across all pairs. Originated mid-sprint, not planned.
Yield spreads (carry)
5 features
Pair-specific interest-rate differentials (USD-JPY, USD-SGD). Captures carry-trade pressure. Best on SGD: pulled MAPE from 0.58% to 0.42% (−28%).
Calendar / cyclical
4 features
Day-of-week, month-end flag, quarter-end flag, sin/cos of day-of-year. Small but consistent contribution at long horizons (days 15-30).
Direct-horizon features
3 buckets × LightGBM
Per-bucket trained predictors: short (h=1-3), medium (h=4-14), long (h=15-30). Each bucket picks its own optimal feature set via Optuna. Avoids recursive-prediction error compounding.
Three forecasting models — and where each one wins
No single model is best at every horizon. SARIMA dominates short, LightGBM dominates long, Prophet covers the middle. The ensemble's job is to know that — per pair, per horizon bucket. statsmodels for SARIMA, facebook/prophet for Prophet, lightgbm with GPU-fallback CPU for the gradient-boosted forecaster.
SARIMA
BEST AT: days 1-3
Per-pair (p,d,q)(P,D,Q,s) frozen at weekly retrain. Optuna over autoregressive orders. Strongest at short horizon (MAPE ~0.5% on JPY days 1-3). Degrades quickly past day 14.
Gotcha: non-stationary AR starting params on some pairs → falls back to zeros, occasional convergence warnings. Caught by walk-forward; not a deployment blocker.
Prophet
BEST AT: days 4-14
Trend + weekly seasonality. Filtered out where MAPE > 2.0× best — auto-excluded from EUR ensemble (5.3% MAPE) and per-bucket capped on others. ~1-2 second fit per pair.
Why it stays: stable variance estimates feed the calibrated CIs (S3 KPI: 92-98% coverage).
LightGBM (Direct, 3-bucket)
BEST AT: days 15-30
Per-pair Optuna params, 38 features. Three buckets trained independently: h=2, h=9, h=23. Eats exogenous macro features. Single biggest mover in the system.
Dominant ensemble weight on EUR (56%) and SGD (41%) after exogenous features were added.
Optuna tuning — single largest accuracy driver
The first version was a default-hyperparameter ensemble. It didn't break the 1% MAPE floor on any pair. Optuna with proper walk-forward validation cut MAPE 60-80% across the board and is now driven automatically by the weekly retrain (with drift gating, see the drift & auto-retune page).
Pair
Before tuning
After tuning
Reduction
Method
PHP
2.72%
0.28%
−90%
Optuna 100 trials, walk-forward
JPY
2.12%
0.59%
−72%
Optuna 100 trials, walk-forward
EUR
1.21%
0.57%
−53%
Optuna 100 trials, walk-forward
GBP
1.45%
1.00%
−31%
Optuna 100 trials, walk-forward
SGD
—
SARIMA 0.28% · LightGBM 0.26% · Prophet 0.50%
tuned in
added in sprint 2
What it bought: S1 KPI (validated accuracy, <5% MAPE per pair) went from 0/5 to 5/5 in two sprints. S2 (beats baseline) went from no ensemble to PHP −64% vs single-model. Compute is cheap once — at retrain time, ~40 minutes per pair × 5 pairs is amortized over a week of daily forecasts.
Macro features — the mid-sprint addition that moved the needle
Three days into the second sprint I added 12 exogenous macro features to LightGBM — DXY, VIX, US 10Y Treasury yield (levels and 1d/7d/30d deltas), gold/oil ratio, a VIX-regime flag. They cut LightGBM MAPE by 14-16% across every pair. After re-tuning the hyperparameters on the new feature set, the cumulative reduction was bigger.
Pair
LightGBM MAPE — before macro
After macro
After re-tune
Total reduction
EUR
0.56%
0.46%
0.38%
−32%
SGD
0.58%
0.42%
0.26%
−55%
JPY
0.71%
0.59%
0.59%
−17%
Why these features: FX rates aren't isolated time series — they reflect rate-differential expectations (US 10Y), risk-on/risk-off regimes (VIX), and dollar strength (DXY). Adding macro pulled signal from outside each pair's own history. The deltas matter more than the levels: a 30-day VIX change tells you the regime shifted; the level alone doesn't.
Process note: macro features were near the bottom of the planned milestone list (M59) — the plan was to ship pure FX features first, validate, then layer in exogenous data. I jumped the queue partway through sprint 2 because the model was plateauing on within-pair features alone. Worth doing earlier next time.
What didn't work — the WHEAT counter-example
Not every experiment ships. The honest record matters more than the success rate.
NEGATIVE RESULT · WHY THE GATE EXISTS
WHEAT was deployed overfit and had to be reverted
Instrument expansion (M68-M73) added WHEAT as a candidate for daily forecasting. Optuna found a 2.79% MAPE — looked acceptable on paper. But it had been trained on the full 2017-2024 history with no walk-forward validation. The first out-of-sample period showed 7-9% errors.
Pulled from production same day. The walk-forward validation gate (validate_pair_has_walkforward()) was added immediately afterward — every new instrument now requires ≥10 walk-forward cutoffs before it can ship to the daily pipeline. WHEAT remains in the codebase as a fixture for testing the gate.
Counterfactual cost: ~2 hours of round-trip work, but the gate it forced now blocks an entire class of "looked good in tuning, fails on holdout" failures.
Where the ML core lands
Best validated MAPE per pair after Optuna tuning and macro features. The 1.5% line is the original accuracy target; all five pairs sit well under it. Two pairs (JPY, SGD) feed the live paper-trade pipeline; the other three (EUR, GBP, PHP) run as forecasts only — EUR specifically because the vol gate at 0.008 rules it out for trading, even though its MAPE is the second-best in the system.