Mock Pineapple · Step 1 · ML Core

Three Models, One Ensemble — Building the FX Forecaster

38 features, three forecasting models, walk-forward validated, Optuna-tuned, ensemble-weighted by horizon. The biggest accuracy lift came from a question Diane asked mid-sprint — not from the planned roadmap.

features

models per pair

−90%

PHP MAPE after Optuna

−55%

SGD MAPE after exogenous

10+

walk-forward cutoffs

Feature inventory — 38 features in 4 classes

Naively dumping every available signal into LightGBM hurts more than it helps. The features below are walk-forward A/B-tested — three feature modules were shipped without ablation and two turned out to be dead weight (SARIMA residuals, CEEMDAN). The script run_feature_ablation.py is now a permanent gate.

Lagged price

5 features

Lags t-1, t-2, t-5, t-10, t-22 (1 month). Captures momentum and mean-reversion at multiple time scales. Always-on for SARIMA, gated by feature-importance for LightGBM.

Rolling statistics

9 features

Mean, std, skew over 5/22/63 windows. Volatility regime signal. The 22-day std turned out to be the strongest single LightGBM feature on JPY (carry-related).

Macro exogenous (Diane's question)

12 features

DXY, VIX, US 10Y Treasury yield — levels + 1d/7d/30d deltas. VIX regime flag (≥20 = stress). Gold/oil ratio. Cut LightGBM MAPE 14-16% across all pairs. Originated mid-sprint, not planned.

Yield spreads (carry)

5 features

Pair-specific interest-rate differentials (USD-JPY, USD-SGD). Captures carry-trade pressure. Best on SGD: pulled MAPE from 0.58% to 0.42% (−28%).

Calendar / cyclical

4 features

Day-of-week, month-end flag, quarter-end flag, sin/cos of day-of-year. Small but consistent contribution at long horizons (days 15-30).

Direct-horizon features

3 buckets × LightGBM

Per-bucket trained predictors: short (h=1-3), medium (h=4-14), long (h=15-30). Each bucket picks its own optimal feature set via Optuna. Avoids recursive-prediction error compounding.

Three forecasting models — and where each one wins

No single model is best at every horizon. SARIMA dominates short, LightGBM dominates long, Prophet covers the middle. The ensemble's job is to know that — per pair, per horizon bucket. statsmodels for SARIMA, facebook/prophet for Prophet, lightgbm with GPU-fallback CPU for the gradient-boosted forecaster.

SARIMA

BEST AT: days 1-3

Per-pair (p,d,q)(P,D,Q,s) frozen at weekly retrain. Optuna over autoregressive orders. Strongest at short horizon (MAPE ~0.5% on JPY days 1-3). Degrades quickly past day 14.

Gotcha: non-stationary AR starting params on some pairs → falls back to zeros, occasional convergence warnings. Caught by walk-forward; not a deployment blocker.

Prophet

BEST AT: days 4-14

Trend + weekly seasonality. Filtered out where MAPE > 2.0× best — auto-excluded from EUR ensemble (5.3% MAPE) and per-bucket capped on others. ~1-2 second fit per pair.

Why it stays: stable variance estimates feed the calibrated CIs (S3 KPI: 92-98% coverage).

LightGBM (Direct, 3-bucket)

BEST AT: days 15-30

Per-pair Optuna params, 38 features. Three buckets trained independently: h=2, h=9, h=23. Eats exogenous macro features. Single biggest mover in the system.

Dominant ensemble weight on EUR (56%) and SGD (41%) after exogenous features were added.

Optuna tuning — single largest accuracy driver

The first version was a default-hyperparameter ensemble. It didn't break the 1% MAPE floor on any pair. Optuna with proper walk-forward validation cut MAPE 60-80% across the board and is now driven automatically by the weekly retrain (with drift gating, see the drift & auto-retune page).

Pair	Before tuning	After tuning	Reduction	Method
PHP	2.72%	0.28%	−90%	Optuna 100 trials, walk-forward
JPY	2.12%	0.59%	−72%	Optuna 100 trials, walk-forward
EUR	1.21%	0.57%	−53%	Optuna 100 trials, walk-forward
GBP	1.45%	1.00%	−31%	Optuna 100 trials, walk-forward
SGD	—	SARIMA 0.28% · LightGBM 0.26% · Prophet 0.50%	tuned in	added in sprint 2

What it bought: S1 KPI (validated accuracy, <5% MAPE per pair) went from 0/5 to 5/5 in two sprints. S2 (beats baseline) went from no ensemble to PHP −64% vs single-model. Compute is cheap once — at retrain time, ~40 minutes per pair × 5 pairs is amortized over a week of daily forecasts.

Macro features — the mid-sprint addition that moved the needle

Three days into the second sprint I added 12 exogenous macro features to LightGBM — DXY, VIX, US 10Y Treasury yield (levels and 1d/7d/30d deltas), gold/oil ratio, a VIX-regime flag. They cut LightGBM MAPE by 14-16% across every pair. After re-tuning the hyperparameters on the new feature set, the cumulative reduction was bigger.

Pair	LightGBM MAPE — before macro	After macro	After re-tune	Total reduction
EUR	0.56%	0.46%	0.38%	−32%
SGD	0.58%	0.42%	0.26%	−55%
JPY	0.71%	0.59%	0.59%	−17%

Why these features: FX rates aren't isolated time series — they reflect rate-differential expectations (US 10Y), risk-on/risk-off regimes (VIX), and dollar strength (DXY). Adding macro pulled signal from outside each pair's own history. The deltas matter more than the levels: a 30-day VIX change tells you the regime shifted; the level alone doesn't.

Process note: macro features were near the bottom of the planned milestone list (M59) — the plan was to ship pure FX features first, validate, then layer in exogenous data. I jumped the queue partway through sprint 2 because the model was plateauing on within-pair features alone. Worth doing earlier next time.

What didn't work — the WHEAT counter-example

Not every experiment ships. The honest record matters more than the success rate.

NEGATIVE RESULT · WHY THE GATE EXISTS

WHEAT was deployed overfit and had to be reverted

Instrument expansion (M68-M73) added WHEAT as a candidate for daily forecasting. Optuna found a 2.79% MAPE — looked acceptable on paper. But it had been trained on the full 2017-2024 history with no walk-forward validation. The first out-of-sample period showed 7-9% errors.

Pulled from production same day. The walk-forward validation gate (validate_pair_has_walkforward()) was added immediately afterward — every new instrument now requires ≥10 walk-forward cutoffs before it can ship to the daily pipeline. WHEAT remains in the codebase as a fixture for testing the gate.

Counterfactual cost: ~2 hours of round-trip work, but the gate it forced now blocks an entire class of "looked good in tuning, fails on holdout" failures.

Where the ML core lands

Best validated MAPE per pair after Optuna tuning and macro features. The 1.5% line is the original accuracy target; all five pairs sit well under it. Two pairs (JPY, SGD) feed the live paper-trade pipeline; the other three (EUR, GBP, PHP) run as forecasts only — EUR specifically because the vol gate at 0.008 rules it out for trading, even though its MAPE is the second-best in the system.

← Overview Step 2-3: Signals & Trades → Step 4: Drift & Auto-Retune →