Learn to Read · ML / Audio · Model Building

From Cosine Similarity to 97.4% — Building the Phoneme Classifier

Six iterations. One domain insight. The +7.9% breakthrough that shipped to production. Walk through how I evolved from a hand-weighted cosine baseline into a hybrid neural + classical hierarchical classifier running on-device for 3-year-olds.

§1See it running

The L2R Android app prompts a sound, listens for the kid's response, runs both classifiers on-device, and decides feedback in under 50 ms.

All inference runs on-device. The model is 15.4 KB Track A + WavLM ONNX delivered via Play Asset Delivery. p95 latency 48 ms. The audio you hear never leaves the phone — COPPA-compliant by architecture.

§2The architecture that shipped

Hierarchical routing, not a single ensemble. Track B picks the manner; Track A's per-manner classifiers refine to a specific phoneme. Each within-manner classifier is a single model — the family was picked per-task by cross-validation.

audio · 10–500 ms · 16 kHz

Track B — WavLM fine-tuned

manner classifier · 97.4% across-manner

plosive

voicing RF 94.3%

place LR 68.6%

/b p t d k g/

fricative

sibilance ET 70.8%

/s z ʃ ʒ f v θ ð/

nasal

place GB 71.4%

/m n ŋ/

approximant

type LR 85.7%

/l r w y/

vowel

letter ET 54.5%

/a e i o u/

affricate

voicing roadmap —

/tʃ dʒ/

Track B (neural, 1 model) Manner branch (gated by Track B) Track A (classical, per-task model family)

How it actually runs at inference time

Track B fires once on the raw audio — picks one of six manner classes. Only that manner branch fires. No softmax across branches, no cross-manner ensembling. For plosive, two Track A classifiers run in parallel (voicing × place → cross-product yields the phoneme). For fricative / nasal / approximant / vowel, one classifier fires. Affricate currently routes to /tʃ/ by default — voicing classifier for /tʃ/ vs /dʒ/ is on the Track A roadmap.

Why a different model family per Track A task

Each within-manner task has a different feature-importance shape, so a different model wins. Plosive voicing is dominated by one feature (f_vot_ms) with a clear threshold — Random Forest handles it cleanly. Plosive place is monotonic in burst peak frequency — Logistic Regression fits. Sibilance has nonlinear feature interactions across the 4–8 kHz band — ExtraTrees (random splits) captures that. Nasal place is small-data, narrow-feature — Gradient Boosting's bias correction wins. Each family was picked per-task by cross-validation, not by global selection.

Why hierarchical instead of flat 26-way

Different manner classes have fundamentally different acoustic signatures (Stevens 1998 §7.2). A plosive burst is a 5–20 ms transient; a fricative is 50–200 ms of sustained noise. A flat 26-way classifier would have to use one feature recipe for both. Hierarchical lets each branch use its best features and its best model family.

§3Stack

Python (research + training)

WavLM base+ via transformers, fine-tuned in PyTorch (2-phase: head only → full unfreeze)
ONNX export — FP32 / FP16 / INT8 variants benchmarked; FP16 shipped
librosa feature extraction — 4 extractor scripts producing ~300 features per sample
scikit-learn for Track A — Random Forest, Logistic Regression, ExtraTrees, Gradient Boosting, SVC (picked per task by CV)
LightGBM as an interpretability cross-check on feature importance
Domain-informed augmentation: weak-burst plosives, reduced-frication fricatives, vowel-like glides, denasalization

Kotlin (Android production)

NeuralPhonemeClassifier — ONNX Runtime wrapper for Track B
PhonemeClassifier — Track A within-manner heads (cosine + tier system)
DevelopmentalFeedbackPolicy — 3-axis policy: Sander mastery age + confidence threshold + dev-substitution table
VTLNCalibrator — child speech vocal-tract normalization (factor 1.104 from child F0 mean)
NoiseRejector — 4-gate audio quality filter (VAD + SNR + spectral flatness + duration)

3-axis feedback policy (replaces hard age-4 cutoff). 100% of "child errors" in early field tests were developmentally-normal substitutions per Sander 1972 (e.g., /w/ for /r/ at age 3). The policy combines per-phoneme mastery age (/m/→2yo, /r/→7yo) + model confidence ≥ 0.7 + a known-substitution lookup table. Simulation result: 100% accuracy on real feedback, zero false negatives on developmentally-typical pronunciations. A 3-year-old is never marked wrong for a normal sound.

§4The features that mattered most

Each Track A within-manner task picks its own features. Feature importance and CV accuracy decide both the top feature and the model family. The winning feature for each task is grounded in published phonetics literature.

Within-manner task	Top feature	Model	All data	Child only
Plosive voicing b/p, d/t, g/k	`f_vot_ms` voice onset time	RF	91.1%	94.3%
Plosive place bilabial · alveolar · velar	`f_burst_peak_hz` burst spectral peak	LR	71.4%	68.6%
Fricative sibilance s/z vs f/v	`f_sibilance` 4–8 kHz energy ratio	ET	83.7%	70.8%
Nasal place m · n · ŋ	`d_nasal_band` narrowband 250–350 Hz delta	GB	100%	71.4%
Approximant type l · r · w · y	`s_f3_hz` third formant frequency	LR	95.8%	85.7%
Vowel letter	`general features` F1 / F2 / MFCCs	ET	66.7%	54.5%

Each top feature is the one that phonetics literature predicts should distinguish those phonemes. f_vot_ms for voicing (Lisker & Abramson 1964), f_burst_peak_hz for plosive place (Stevens 1998), f_sibilance for sibilant fricatives (Jongman et al. 2000), narrowband nasal formant for nasal place (Stevens 1998), F3 for /r/-vs-/l/ (Espy-Wilson 1992). The classical model learned the phonetics curriculum. The full feature catalog and contrastive-pair visualizations live on the feature engineering page.

Vowel is the bottleneck — currently lacks F1/F2 formants in the 84-feature overlap available at training time. Full feature re-extraction is queued (Track A roadmap).

§5Evolution of accuracy

Six iterations. One domain insight took Track B from 89.5% → 97.4%.

Five iterations on Track B. The +7.9% delta from 89.5% → 97.4% came not from more data or a bigger model — it came from one domain insight about child plosive burst energy ↓.

5.1 Cosine similarity baseline heuristic, no learning

Originally shipping in L2R V1: 5 hand-weighted features — MFCC 70% + spectral centroid 10% + ZCR 10% + duration 5% + RMS 5%. Cosine match against 26 reference profiles, one feature set for every manner class.

Lesson: Plosive bursts and fricative noise need fundamentally different features. A single feature recipe for all manner classes is structurally wrong. Need a learned classifier — and need to validate the data is good enough to learn on first.

5.2 Random Forest on MFCC — proving the data is good 80.6% CV · 57% speaker-stratified

108 verified samples, 284 features (the full classical catalog). Two evaluation regimes: 5-fold CV and speaker-stratified (train = adult + 4yo, test = 2yo child).

The 23-point drop between CV and speaker-stratified is the entire problem. Classical features overfit to speaker characteristics; they don't generalize across age.

Lesson: The data is fine. The features overfit to the speaker. Need representations that generalize across speaker ages → pretrained SSL.

5.3 WavLM embeddings beat MFCC +14% over MFCC

WavLM base+ as a frozen feature extractor, classical head (RF / LR / SVC) on top. Same speaker-stratified split. +14% over the MFCC baseline.

WavLM chosen over wav2vec2 and HuBERT — Blockmedin et al. (Interspeech 2024) showed WavLM is the best-performing SSL model for children's phoneme recognition.

Lesson: Pretrained SSL representations close the speaker generalization gap. Now go further — fine-tune end-to-end.

5.4 Fine-tune WavLM end-to-end 89.5%

Two-phase fine-tune: (1) classification head only, (2) unfreeze top WavLM layers and train full model.

Phase 1 GO/NO-GO gate: 71.4%. Phase 2 final: 89.5%.

Inspecting errors: 4 of 38 wrong, every single one a 2-year-old plosive. Adult and 4yo plosives were fine; 2yo plosives were systematically misclassified.

Lesson: Fine-tuning helps, but the data distribution still doesn't match what 2-year-olds produce. Time to look at the actual audio.

5.5 The +7.9% breakthrough — domain-informed augmentation ★ headline 92.1% → 97.4%

The insight. I went back to the audio and measured the misclassifications. 2-year-olds produce plosives with half the burst energy of 4-year-olds. Their consonant onsets are softer, often vowelized, sometimes prevoiced.

Side-by-side mel spectrograms of /d/ — 2yo child vs 4yo child — burst region highlighted, 2yo at 61% of 4yo's burst RMS energy — Real verified recordings — `phonics_d.wav` from a 2-year-old (left) and a 4-year-old (right), same in-home phone-mic sessions used for training. The first **80 ms** burst window is highlighted on each. RMS energy in that window: **0.0895** at age 2 vs **0.1465** at age 4 — the 2-year-old produces the burst at **≈ 61%** of the older child's energy. Across the corpus the average ratio is closer to ½; `/d/` visualizes the pattern cleanly. Generic noise/pitch augmentation can't recreate this — the structure is qualitatively different, not just quieter.

First attempt — plosive-only augmentation: 92.1%. Synthesized weak-burst, vowelized, prevoiced plosive variants. +2.6% on plosives — but this skewed the training distribution toward plosives and introduced regressions in fricatives and nasals.

Final — balanced augmentation across all classes: 97.4%. Same domain-informed augmentation philosophy applied to every class proportionally:

Plosive: weak-burst, vowelized, prevoiced
Fricative: reduced-frication noise (child /s/ is often weaker)
Approximant: more vowel-like glides (child /r/→/w/)
Nasal: denasalization variants

All 4 child-plosive errors fixed. Zero regressions in other classes.

Lesson: Generic augmentation (pitch / speed / noise) treats child speech like adult speech with noise. Domain-informed augmentation, applied with class balance, beat it. +7.9% from a domain insight — not from more data, more compute, or a bigger model.

What I tested and discarded

Knowledge distillation: trained a tiny student model. Failed at 94 samples — not enough signal for the smaller capacity. Kept full WavLM via Play Asset Delivery.
Volume normalization: RMS / peak normalization changed 0 predictions across the test set. Removed from pipeline.
WavLM layer probing: probed all 12 transformer layers. Layer 10 wins for pretrained probing, layer 7 wins after fine-tune. Informed the fine-tuning unfreeze schedule.

Source: neural_deep_analysis.py. I tested my assumptions — two failed, one informed the architecture.

§6References

Lisker, L. & Abramson, A.S. (1964) · "Cross-language study of voicing in initial stops" — VOT, voicing boundary in plosives
Stevens, K.N. (1998) · Acoustic Phonetics, MIT Press — manner class signatures, plosive place, nasal anti-formants
Sander, E.K. (1972) · "When are speech sounds learned?" JSHD 37(1) — phoneme acquisition timeline
Jongman, A., Wayland, R. & Wong, S. (2000) · "Acoustic characteristics of English fricatives," JASA 108(3) — sibilance band
Espy-Wilson, C.Y. (1992) · "Acoustic measures for semivowels /w j r l/," JASA 92(2) — F3 for /r/ vs /l/
Lee, S., Potamianos, A. & Narayanan, S. (1999) · "Acoustics of children's speech," JASA 105(3) — child formant frequencies, VTLN basis
Howell, P. & Rosen, S. (1983) · "Production and perception of rise time in the affricate-fricative distinction," JASA 73(3)
Smit, A.B. et al. (1990) · "The Iowa Articulation Norms Project" — acquisition ages
Blockmedin et al. (2024) · "Self-supervised phoneme recognition for children's reading," Interspeech 2024 — WavLM base+ selected as best SSL model

← Phoneme classifier overview Feature engineering →