← back to portfolio
Learn to Read · Phoneme Classifier · Feature Engineering

From audio to model inputs

How I went from "kids' phone-mic recordings in noisy homes" to a 300-feature representation that the classifier could actually learn from. Grounded in acoustic phonetics theory, validated empirically with Random Forest, narrowed to the 5 features that won their respective within-manner tasks.
Why we collected our own data

Three constraints made off-the-shelf phoneme corpora and synthetic audio unusable. We had to record real children, in real homes, on real phones — and verify every clip ourselves.

① Children's speech is NOT adult speech with noise

  • 2-year-olds produce plosives with ½ the burst energy of 4-year-olds
  • Children's formants are 20–50% higher than adult males (Lee, Potamianos & Narayanan 1999)
  • F0 250–500 Hz vs adult 80–200 Hz
Implication: adult-trained features need warping (VTLN) before they apply to kids — and even then, plosive burst structure is qualitatively different.

② TTS is bad at isolated phonemes

  • Generic TTS (GCP Neural2, etc.) is trained on words and sentences, not isolated phonemes
  • Synthetic /f/ doesn't have the right turbulence pattern
  • Synthetic plosives lack realistic burst variability
Used as evaluation baseline only. Never as training data.

③ On-device requirement (COPPA)

  • Audio of children cannot go to the cloud — consent overhead, latency, ongoing cost
  • Constraint: model + features must fit on a low-end Android in <100 ms
Forces classical features (15.4 KB asset) as the on-device layer. Drives the engineering all the way back to which features even fit the budget.
▶ Sample collection data — Diane recording all 26 letters

A single 26.3-second take on a phone mic. I read each letter of the alphabet with a clear pause between sounds. This one file becomes 25 individually labeled phoneme clips after processing.

From a single take to 25 labeled phonemes

A long recording is cheap to make; the labeling is what's expensive. The pipeline does the boring part automatically and routes only the verification step to me.

1

Energy-based silence detection

Compute RMS in 25 ms frames (10 ms hop). A frame is "silence" if RMS is more than 30 dB below the recording's peak. Speech regions are runs of non-silent frames separated by gaps of at least 150 ms — long enough to be a real word boundary, short enough to keep adjacent letters in different segments.

2

Morphological cleanup

scipy.ndimage.binary_closing + binary_opening close micro-gaps inside a single phoneme (e.g., the stop closure inside /p/) and remove single-frame blips. Each surviving region is padded by 30 ms on both ends so the burst onset and tail of fricatives aren't clipped.

3

Auto-export to per-segment WAVs

Every region ≥ 200 ms gets written as raw_a_001.wavraw_a_025.wav. The original 26.3 s recording produces 25 segments; the 26th letter ran into the prior gap and was rejected by the minimum-duration filter — exactly the kind of edge case I want flagged, not silently kept.

4

Diane-verified labeling dashboard

The script also generates an HTML review page: each segment with its own audio control, a text input for the letter, and a manner-class dropdown (auto-filled from the letter — a→vowel, b→plosive, m→nasal, etc.). I listen to every clip, type the letter, and only commit the JSON once each clip sounds right. Ground-truth labels never come from a classifier.

The same pipeline runs over the in-home child recordings (2-year-old, 4-year-old, on a phone mic with normal household noise). It scales from one studio take to dozens of children without changing a line of code — only the silence threshold occasionally needs widening for the noisier homes. Total verified training corpus: 97 pooled samples across 6 manner classes. Synthetic TTS (60 clips, GCP Neural2) is held out as evaluation baseline only — never used for training.
Listen to the two phonemes

Hear before you analyze. Vowel /æ/ is sustained and tonal. Fricative /f/ is short and noisy.

/æ/ vowel — the letter "a"

Diane's voice (training baseline)
2-year-old
4-year-old

/f/ fricative — the letter "f"

Diane's voice (training baseline)
2-year-old (adult prompt — /f/ not yet acquired at age 2)
4-year-old
Waveform — time domain

Vowel /æ/: periodic, regular oscillation (vocal-fold vibration). Fricative /f/: aperiodic, noise-like (turbulent airflow through the lips and teeth). The difference is visible by eye before any feature is extracted.

Mel spectrogram — time × frequency × energy

Hot colors = more energy. Vowel: stacked horizontal bands (formants F1 ~700 Hz, F2 ~1700 Hz). Fricative: diffuse high-frequency wash (4–8 kHz energy from turbulence). Feature extraction operates on three windows over this image: onset (first 150 ms), steady (middle 20–80%), and full segment.

The cross-window split — change beats snapshot

When I trained a Random Forest on the full ~300-feature catalog, the importance ranking made one thing immediately obvious: the model wasn't reading the snapshot. It was reading the change.

The top features in the Random Forest are dominated by cross-window deltasd_flat, d_rms, d_cent, d_zcr. The model learned that change between onset and steady-state matters more than any single-window snapshot. Manner classes are defined by how the airstream is modified over time — not by a single moment.

Top 5 RF features by importance (trained on 97 verified samples, 65 features, 5-fold CV):

RankFeatureImportanceWhat it captures
1d_flat0.0426onset flatness − steady flatness
2d_rms0.0411onset RMS − steady RMS (energy profile shape)
3o_m40.0363onset MFCC coefficient 4 (burst characteristics)
4s_m110.0348steady-state MFCC coefficient 11 (sustained shape)
5d_cent0.0331onset centroid − steady centroid (brightness shift)

Plosives front-load energy (positive d_rms); vowels are even (≈0); fricatives plateau or rise. That single shape distinction separates manner classes that look similar in any single window.

The 5 winning features — contrastive pairs

Random Forest importance gave us the shape of the answer (deltas dominate). Acoustic phonetics literature gave us the specific features. Each card below is a feature that won a single within-manner classification task on the child-only test set, paired with a side-by-side visualization of the two phonemes it distinguishes.

These 5 were prioritized from a 60-feature literature catalog (see the phoneme classifier overview). Each won a specific within-manner classification task in Track A's classical pipeline, and each is grounded in published phonetics literature. The model learned the phonetics curriculum.

The full feature catalog

The 5 task-winners are the visible top of a deeper extraction pipeline. The classifier sees ~300 features per sample, computed by 4 specialized extractors:

ExtractorCategoriesApprox count
extract_a_mfcc.py13 MFCCs + Δ + ΔΔ × 3 windows~130
extract_b_spectral.pyRMS, centroid, bandwidth, rolloff, flatness, contrast (7 bands), flux, skewness, kurtosis, slope, entropy, ZCR, 6-band energy ratios~108
extract_c_temporal.pyF0, voicing ratio, HNR, jitter, shimmer, onset strength, VOT, burst peak, burst duration, aspiration~50
extract_d_formant.pyLPC F1–F4 + bandwidths via Levinson-Durbin~10
Total per sample~300 features
  • Window structure: every spectral / temporal feature is computed three times — onset (first 150 ms), steady (middle 20–80%), full segment. Each window also yields cross-window deltas (onset − steady).
  • Track A classical input: all ~300 features feed LightGBM / RandomForest / LogisticRegression / ExtraTrees / GradientBoost — model family selected per within-manner task.
  • Track B neural input: WavLM consumes raw audio waveform directly. No hand-engineered features. The two tracks complement each other in the production hybrid.
Back to phonetics theory

The closing-the-loop check: does what the trained model finds important agree with what the linguistics literature says should matter?

Phonetics theory (Stevens 1998 §7.2) predicts that:

  • Fricatives are distinguished by spectral centroid, spectral peak, and sibilance band (4–8 kHz turbulence).
  • Plosives are distinguished by VOT (voicing) and burst spectral peak (place of articulation).
  • Vowels are distinguished by F1 (height) and F2 (front-back).
  • Approximants /r/ vs /l/ are distinguished almost entirely by F3.
  • Nasals are distinguished by a narrowband formant near 250–350 Hz.

The trained Random Forest's top features match the literature predictions: spectral centroid change at the top, flatness change capturing noise vs tone, F3 winning approximants, sibilance band winning fricatives, narrowband nasal formant unlocking nasals. Within-manner tasks were solved by the exact features Stevens predicted.

I didn't pick MFCC by default. I triaged 60+ candidate features against acoustic phonetics theory, validated empirically with Random Forest, and shipped the 5 task-winners — all on a 15.4 KB on-device asset.

References

  1. Stevens, K. N. (1998). Acoustic Phonetics. MIT Press. — Chapters 7 (plosives), 8 (fricatives), 9 (nasals).
  2. Lisker, L. & Abramson, A. S. (1964). A cross-language study of voicing in initial stops: Acoustical measurements. Word, 20(3).
  3. Jongman, A., Wayland, R. & Wong, S. (2000). Acoustic characteristics of English fricatives. JASA, 108(3).
  4. Espy-Wilson, C. Y. (1992). Acoustic measures for linguistic features distinguishing the semivowels /w j r l/ in American English. JASA, 92(2).
  5. Howell, P. & Rosen, S. (1983). Production and perception of rise time in the voiceless affricate/fricative distinction. JASA, 73(3).
  6. Lee, S., Potamianos, A. & Narayanan, S. (1999). Acoustics of children's speech: Developmental changes of temporal and spectral parameters. JASA, 105(3).
  7. Sander, E. K. (1972). When are speech sounds learned? JSHD, 37(1).
  8. Peterson, G. E. & Barney, H. L. (1952). Control methods used in a study of the vowels. JASA, 24(2).
  9. Blockmedin et al. (2024). Self-supervised phoneme recognition for children's reading. Interspeech 2024.

Feature values plotted on this page are computed from phoneme_qa/phonics_a.wav and phoneme_qa/phonics_f.wav using the production extractor. Top-5 RF importances from vault/2 dev/vp-ml-neural/phoneme-feature-research-2026-04-02.md. Within-manner accuracies from vault/2 dev/vp-ml-new/2026-04-02-track-a-final-handoff.md.