Learn to Read · Phoneme Classifier · Feature Engineering

From audio to model inputs

How I went from "kids' phone-mic recordings in noisy homes" to a 300-feature representation that the classifier could actually learn from. Grounded in acoustic phonetics theory, validated empirically with Random Forest, narrowed to the 5 features that won their respective within-manner tasks.

Why we collected our own data

Three constraints made off-the-shelf phoneme corpora and synthetic audio unusable. We had to record real children, in real homes, on real phones — and verify every clip ourselves.

① Children's speech is NOT adult speech with noise

2-year-olds produce plosives with ½ the burst energy of 4-year-olds
Children's formants are 20–50% higher than adult males (Lee, Potamianos & Narayanan 1999)
F0 250–500 Hz vs adult 80–200 Hz

Implication: adult-trained features need warping (VTLN) before they apply to kids — and even then, plosive burst structure is qualitatively different.

② TTS is bad at isolated phonemes

Generic TTS (GCP Neural2, etc.) is trained on words and sentences, not isolated phonemes
Synthetic /f/ doesn't have the right turbulence pattern
Synthetic plosives lack realistic burst variability

Used as evaluation baseline only. Never as training data.

③ On-device requirement (COPPA)

Audio of children cannot go to the cloud — consent overhead, latency, ongoing cost
Constraint: model + features must fit on a low-end Android in <100 ms

Forces classical features (15.4 KB asset) as the on-device layer. Drives the engineering all the way back to which features even fit the budget.

▶ Sample collection data — Diane recording all 26 letters

A single 26.3-second take on a phone mic. I read each letter of the alphabet with a clear pause between sounds. This one file becomes 25 individually labeled phoneme clips after processing.

From a single take to 25 labeled phonemes

A long recording is cheap to make; the labeling is what's expensive. The pipeline does the boring part automatically and routes only the verification step to me.

Energy-based silence detection

Compute RMS in 25 ms frames (10 ms hop). A frame is "silence" if RMS is more than 30 dB below the recording's peak. Speech regions are runs of non-silent frames separated by gaps of at least 150 ms — long enough to be a real word boundary, short enough to keep adjacent letters in different segments.

Morphological cleanup

scipy.ndimage.binary_closing + binary_opening close micro-gaps inside a single phoneme (e.g., the stop closure inside /p/) and remove single-frame blips. Each surviving region is padded by 30 ms on both ends so the burst onset and tail of fricatives aren't clipped.

Auto-export to per-segment WAVs

Every region ≥ 200 ms gets written as raw_a_001.wav … raw_a_025.wav. The original 26.3 s recording produces 25 segments; the 26th letter ran into the prior gap and was rejected by the minimum-duration filter — exactly the kind of edge case I want flagged, not silently kept.

Diane-verified labeling dashboard

The script also generates an HTML review page: each segment with its own audio control, a text input for the letter, and a manner-class dropdown (auto-filled from the letter — a→vowel, b→plosive, m→nasal, etc.). I listen to every clip, type the letter, and only commit the JSON once each clip sounds right. Ground-truth labels never come from a classifier.

The same pipeline runs over the in-home child recordings (2-year-old, 4-year-old, on a phone mic with normal household noise). It scales from one studio take to dozens of children without changing a line of code — only the silence threshold occasionally needs widening for the noisier homes. Total verified training corpus: 97 pooled samples across 6 manner classes. Synthetic TTS (60 clips, GCP Neural2) is held out as evaluation baseline only — never used for training.

Listen to the two phonemes

Hear before you analyze. Vowel /æ/ is sustained and tonal. Fricative /f/ is short and noisy.

/æ/ vowel — the letter "a"

Diane's voice (training baseline)

2-year-old

4-year-old

/f/ fricative — the letter "f"

Diane's voice (training baseline)

2-year-old (adult prompt — /f/ not yet acquired at age 2)

4-year-old

Waveform — time domain

Vowel /æ/: periodic, regular oscillation (vocal-fold vibration). Fricative /f/: aperiodic, noise-like (turbulent airflow through the lips and teeth). The difference is visible by eye before any feature is extracted.

Mel spectrogram — time × frequency × energy

Hot colors = more energy. Vowel: stacked horizontal bands (formants F1 ~700 Hz, F2 ~1700 Hz). Fricative: diffuse high-frequency wash (4–8 kHz energy from turbulence). Feature extraction operates on three windows over this image: onset (first 150 ms), steady (middle 20–80%), and full segment.

The cross-window split — change beats snapshot

When I trained a Random Forest on the full ~300-feature catalog, the importance ranking made one thing immediately obvious: the model wasn't reading the snapshot. It was reading the change.

The top features in the Random Forest are dominated by cross-window deltas — d_flat, d_rms, d_cent, d_zcr. The model learned that change between onset and steady-state matters more than any single-window snapshot. Manner classes are defined by how the airstream is modified over time — not by a single moment.

Top 5 RF features by importance (trained on 97 verified samples, 65 features, 5-fold CV):

Rank	Feature	Importance	What it captures
1	`d_flat`	0.0426	onset flatness − steady flatness
2	`d_rms`	0.0411	onset RMS − steady RMS (energy profile shape)
3	`o_m4`	0.0363	onset MFCC coefficient 4 (burst characteristics)
4	`s_m11`	0.0348	steady-state MFCC coefficient 11 (sustained shape)
5	`d_cent`	0.0331	onset centroid − steady centroid (brightness shift)

Plosives front-load energy (positive d_rms); vowels are even (≈0); fricatives plateau or rise. That single shape distinction separates manner classes that look similar in any single window.

The 5 winning features — contrastive pairs

Random Forest importance gave us the shape of the answer (deltas dominate). Acoustic phonetics literature gave us the specific features. Each card below is a feature that won a single within-manner classification task on the child-only test set, paired with a side-by-side visualization of the two phonemes it distinguishes.

VOT — Voice Onset Time

milliseconds · /b/ vs /p/

Time between the burst release and the first vocal fold vibration. Voiced /b/ has near-zero VOT; voiceless /p/ has a 50–60 ms gap of pure aspiration before voicing kicks in.

Plosive voicing · 94.3% child-only · Random Forest

Lisker & Abramson (1964)

VOT contrastive pair: /b/ vs /p/ waveforms with burst and voicing-onset markers

Burst spectral peak

Hz · /p/ vs /t/

Frequency of the loudest spectral bin during the release burst. Bilabial /p/ has diffuse low-frequency energy (peak < 1.5 kHz); alveolar /t/ has a compact high-frequency burst (3–5 kHz).

Plosive place · 68.6% child-only · Logistic Regression

Stevens (1998), §7

Burst peak contrastive pair: /p/ vs /t/ spectrograms with peak-frequency markers

Sibilance band — 4–8 kHz energy ratio

ratio · /s/ vs /f/

Fraction of total energy concentrated in the 4–8 kHz band. Sibilants /s, z, ʃ, ʒ/ have 15–25 dB more energy in this range than non-sibilants /f, v, θ, ð/. Often called "the single most reliable acoustic cue" for fricative subclassification.

Fricative sibilance · 70.8% child-only · Extra Trees

Jongman, Wayland & Wong (2000)

Sibilance band contrastive pair: /s/ vs /f/ spectrograms with 4–8 kHz band highlighted

Narrowband nasal formant — 250–350 Hz

band energy · /m/ vs /a/

A characteristic resonance at ~250–300 Hz produced by the nasal cavity. Nasals /m, n/ have strong energy in this narrow band; vowels do not. The pipeline's old lf_ratio at 200–500 Hz was too wide to discriminate.

Nasal place · 71.4% child-only · Gradient Boost

Stevens (1998), §9

Nasal formant contrastive pair: /m/ vs /a/ spectrograms with 250–350 Hz band highlighted

F3 — third formant frequency

Hz · /r/ vs /l/

The third formant frequency, extracted via LPC root-finding. /ɹ/ has a strikingly low F3 (1300–1700 Hz); /l/ has a high F3 (> 2200 Hz). This single feature separates the two approximants that humans most often confuse.

Approximant /r/-/l/ · 85.7% child-only · Logistic Regression

Espy-Wilson (1992)

F3 contrastive pair: /r/ vs /l/ spectrograms with F3 marker

These 5 were prioritized from a 60-feature literature catalog (see the phoneme classifier overview). Each won a specific within-manner classification task in Track A's classical pipeline, and each is grounded in published phonetics literature. The model learned the phonetics curriculum.

The full feature catalog

The 5 task-winners are the visible top of a deeper extraction pipeline. The classifier sees ~300 features per sample, computed by 4 specialized extractors:

Extractor	Categories	Approx count
`extract_a_mfcc.py`	13 MFCCs + Δ + ΔΔ × 3 windows	~130
`extract_b_spectral.py`	RMS, centroid, bandwidth, rolloff, flatness, contrast (7 bands), flux, skewness, kurtosis, slope, entropy, ZCR, 6-band energy ratios	~108
`extract_c_temporal.py`	F0, voicing ratio, HNR, jitter, shimmer, onset strength, VOT, burst peak, burst duration, aspiration	~50
`extract_d_formant.py`	LPC F1–F4 + bandwidths via Levinson-Durbin	~10
Total per sample		~300 features

Window structure: every spectral / temporal feature is computed three times — onset (first 150 ms), steady (middle 20–80%), full segment. Each window also yields cross-window deltas (onset − steady).
Track A classical input: all ~300 features feed LightGBM / RandomForest / LogisticRegression / ExtraTrees / GradientBoost — model family selected per within-manner task.
Track B neural input: WavLM consumes raw audio waveform directly. No hand-engineered features. The two tracks complement each other in the production hybrid.

Back to phonetics theory

The closing-the-loop check: does what the trained model finds important agree with what the linguistics literature says should matter?

Phonetics theory (Stevens 1998 §7.2) predicts that:

Fricatives are distinguished by spectral centroid, spectral peak, and sibilance band (4–8 kHz turbulence).
Plosives are distinguished by VOT (voicing) and burst spectral peak (place of articulation).
Vowels are distinguished by F1 (height) and F2 (front-back).
Approximants /r/ vs /l/ are distinguished almost entirely by F3.
Nasals are distinguished by a narrowband formant near 250–350 Hz.

The trained Random Forest's top features match the literature predictions: spectral centroid change at the top, flatness change capturing noise vs tone, F3 winning approximants, sibilance band winning fricatives, narrowband nasal formant unlocking nasals. Within-manner tasks were solved by the exact features Stevens predicted.

I didn't pick MFCC by default. I triaged 60+ candidate features against acoustic phonetics theory, validated empirically with Random Forest, and shipped the 5 task-winners — all on a 15.4 KB on-device asset.

References

Stevens, K. N. (1998). Acoustic Phonetics. MIT Press. — Chapters 7 (plosives), 8 (fricatives), 9 (nasals).
Lisker, L. & Abramson, A. S. (1964). A cross-language study of voicing in initial stops: Acoustical measurements. Word, 20(3).
Jongman, A., Wayland, R. & Wong, S. (2000). Acoustic characteristics of English fricatives. JASA, 108(3).
Espy-Wilson, C. Y. (1992). Acoustic measures for linguistic features distinguishing the semivowels /w j r l/ in American English. JASA, 92(2).
Howell, P. & Rosen, S. (1983). Production and perception of rise time in the voiceless affricate/fricative distinction. JASA, 73(3).
Lee, S., Potamianos, A. & Narayanan, S. (1999). Acoustics of children's speech: Developmental changes of temporal and spectral parameters. JASA, 105(3).
Sander, E. K. (1972). When are speech sounds learned? JSHD, 37(1).
Peterson, G. E. & Barney, H. L. (1952). Control methods used in a study of the vowels. JASA, 24(2).
Blockmedin et al. (2024). Self-supervised phoneme recognition for children's reading. Interspeech 2024.

Feature values plotted on this page are computed from phoneme_qa/phonics_a.wav and phoneme_qa/phonics_f.wav using the production extractor. Top-5 RF importances from vault/2 dev/vp-ml-neural/phoneme-feature-research-2026-04-02.md. Within-manner accuracies from vault/2 dev/vp-ml-new/2026-04-02-track-a-final-handoff.md.