From audio to model inputs
Three constraints made off-the-shelf phoneme corpora and synthetic audio unusable. We had to record real children, in real homes, on real phones — and verify every clip ourselves.
① Children's speech is NOT adult speech with noise
- 2-year-olds produce plosives with ½ the burst energy of 4-year-olds
- Children's formants are 20–50% higher than adult males (Lee, Potamianos & Narayanan 1999)
- F0 250–500 Hz vs adult 80–200 Hz
② TTS is bad at isolated phonemes
- Generic TTS (GCP Neural2, etc.) is trained on words and sentences, not isolated phonemes
- Synthetic /f/ doesn't have the right turbulence pattern
- Synthetic plosives lack realistic burst variability
③ On-device requirement (COPPA)
- Audio of children cannot go to the cloud — consent overhead, latency, ongoing cost
- Constraint: model + features must fit on a low-end Android in <100 ms
A single 26.3-second take on a phone mic. I read each letter of the alphabet with a clear pause between sounds. This one file becomes 25 individually labeled phoneme clips after processing.
From a single take to 25 labeled phonemes
A long recording is cheap to make; the labeling is what's expensive. The pipeline does the boring part automatically and routes only the verification step to me.
Energy-based silence detection
Compute RMS in 25 ms frames (10 ms hop). A frame is "silence" if RMS is more than 30 dB below the recording's peak. Speech regions are runs of non-silent frames separated by gaps of at least 150 ms — long enough to be a real word boundary, short enough to keep adjacent letters in different segments.
Morphological cleanup
scipy.ndimage.binary_closing + binary_opening close micro-gaps inside a single phoneme (e.g., the stop closure inside /p/) and remove single-frame blips. Each surviving region is padded by 30 ms on both ends so the burst onset and tail of fricatives aren't clipped.
Auto-export to per-segment WAVs
Every region ≥ 200 ms gets written as raw_a_001.wav … raw_a_025.wav. The original 26.3 s recording produces 25 segments; the 26th letter ran into the prior gap and was rejected by the minimum-duration filter — exactly the kind of edge case I want flagged, not silently kept.
Diane-verified labeling dashboard
The script also generates an HTML review page: each segment with its own audio control, a text input for the letter, and a manner-class dropdown (auto-filled from the letter — a→vowel, b→plosive, m→nasal, etc.). I listen to every clip, type the letter, and only commit the JSON once each clip sounds right. Ground-truth labels never come from a classifier.
Hear before you analyze. Vowel /æ/ is sustained and tonal. Fricative /f/ is short and noisy.
/æ/ vowel — the letter "a"
/f/ fricative — the letter "f"
Vowel /æ/: periodic, regular oscillation (vocal-fold vibration). Fricative /f/: aperiodic, noise-like (turbulent airflow through the lips and teeth). The difference is visible by eye before any feature is extracted.
Hot colors = more energy. Vowel: stacked horizontal bands (formants F1 ~700 Hz, F2 ~1700 Hz). Fricative: diffuse high-frequency wash (4–8 kHz energy from turbulence). Feature extraction operates on three windows over this image: onset (first 150 ms), steady (middle 20–80%), and full segment.
When I trained a Random Forest on the full ~300-feature catalog, the importance ranking made one thing immediately obvious: the model wasn't reading the snapshot. It was reading the change.
d_flat, d_rms, d_cent, d_zcr. The model learned that change between onset and steady-state matters more than any single-window snapshot. Manner classes are defined by how the airstream is modified over time — not by a single moment.
Top 5 RF features by importance (trained on 97 verified samples, 65 features, 5-fold CV):
| Rank | Feature | Importance | What it captures |
|---|---|---|---|
| 1 | d_flat | 0.0426 | onset flatness − steady flatness |
| 2 | d_rms | 0.0411 | onset RMS − steady RMS (energy profile shape) |
| 3 | o_m4 | 0.0363 | onset MFCC coefficient 4 (burst characteristics) |
| 4 | s_m11 | 0.0348 | steady-state MFCC coefficient 11 (sustained shape) |
| 5 | d_cent | 0.0331 | onset centroid − steady centroid (brightness shift) |
Plosives front-load energy (positive d_rms); vowels are even (≈0); fricatives plateau or rise. That single shape distinction separates manner classes that look similar in any single window.
Random Forest importance gave us the shape of the answer (deltas dominate). Acoustic phonetics literature gave us the specific features. Each card below is a feature that won a single within-manner classification task on the child-only test set, paired with a side-by-side visualization of the two phonemes it distinguishes.
These 5 were prioritized from a 60-feature literature catalog (see the phoneme classifier overview). Each won a specific within-manner classification task in Track A's classical pipeline, and each is grounded in published phonetics literature. The model learned the phonetics curriculum.
The 5 task-winners are the visible top of a deeper extraction pipeline. The classifier sees ~300 features per sample, computed by 4 specialized extractors:
| Extractor | Categories | Approx count |
|---|---|---|
extract_a_mfcc.py | 13 MFCCs + Δ + ΔΔ × 3 windows | ~130 |
extract_b_spectral.py | RMS, centroid, bandwidth, rolloff, flatness, contrast (7 bands), flux, skewness, kurtosis, slope, entropy, ZCR, 6-band energy ratios | ~108 |
extract_c_temporal.py | F0, voicing ratio, HNR, jitter, shimmer, onset strength, VOT, burst peak, burst duration, aspiration | ~50 |
extract_d_formant.py | LPC F1–F4 + bandwidths via Levinson-Durbin | ~10 |
| Total per sample | ~300 features | |
- Window structure: every spectral / temporal feature is computed three times — onset (first 150 ms), steady (middle 20–80%), full segment. Each window also yields cross-window deltas (onset − steady).
- Track A classical input: all ~300 features feed LightGBM / RandomForest / LogisticRegression / ExtraTrees / GradientBoost — model family selected per within-manner task.
- Track B neural input: WavLM consumes raw audio waveform directly. No hand-engineered features. The two tracks complement each other in the production hybrid.
The closing-the-loop check: does what the trained model finds important agree with what the linguistics literature says should matter?
Phonetics theory (Stevens 1998 §7.2) predicts that:
- Fricatives are distinguished by spectral centroid, spectral peak, and sibilance band (4–8 kHz turbulence).
- Plosives are distinguished by VOT (voicing) and burst spectral peak (place of articulation).
- Vowels are distinguished by F1 (height) and F2 (front-back).
- Approximants /r/ vs /l/ are distinguished almost entirely by F3.
- Nasals are distinguished by a narrowband formant near 250–350 Hz.
The trained Random Forest's top features match the literature predictions: spectral centroid change at the top, flatness change capturing noise vs tone, F3 winning approximants, sibilance band winning fricatives, narrowband nasal formant unlocking nasals. Within-manner tasks were solved by the exact features Stevens predicted.
I didn't pick MFCC by default. I triaged 60+ candidate features against acoustic phonetics theory, validated empirically with Random Forest, and shipped the 5 task-winners — all on a 15.4 KB on-device asset.
References
- Stevens, K. N. (1998). Acoustic Phonetics. MIT Press. — Chapters 7 (plosives), 8 (fricatives), 9 (nasals).
- Lisker, L. & Abramson, A. S. (1964). A cross-language study of voicing in initial stops: Acoustical measurements. Word, 20(3).
- Jongman, A., Wayland, R. & Wong, S. (2000). Acoustic characteristics of English fricatives. JASA, 108(3).
- Espy-Wilson, C. Y. (1992). Acoustic measures for linguistic features distinguishing the semivowels /w j r l/ in American English. JASA, 92(2).
- Howell, P. & Rosen, S. (1983). Production and perception of rise time in the voiceless affricate/fricative distinction. JASA, 73(3).
- Lee, S., Potamianos, A. & Narayanan, S. (1999). Acoustics of children's speech: Developmental changes of temporal and spectral parameters. JASA, 105(3).
- Sander, E. K. (1972). When are speech sounds learned? JSHD, 37(1).
- Peterson, G. E. & Barney, H. L. (1952). Control methods used in a study of the vowels. JASA, 24(2).
- Blockmedin et al. (2024). Self-supervised phoneme recognition for children's reading. Interspeech 2024.
Feature values plotted on this page are computed from phoneme_qa/phonics_a.wav and phoneme_qa/phonics_f.wav using the production extractor. Top-5 RF importances from vault/2 dev/vp-ml-neural/phoneme-feature-research-2026-04-02.md. Within-manner accuracies from vault/2 dev/vp-ml-new/2026-04-02-track-a-final-handoff.md.