PROBLEM: A speech model trained on adult corpora misclassifies a 4-year-old. Children have higher fundamental frequency, shorter vocal tracts, and different formant ratios. The model needs ground-truth child recordings — but COPPA-friendly child speech datasets don't exist for this use case.
WHY IT MATTERS: No anonymous internet "child speech" dataset is COPPA-safe to train on. The pragmatic answer: I record my own children, label every clip myself, and gate by parental consent (mine). This is the dataset that backs the VTLN normalization factor, the developmental substitution table, and the 97.4% across-manner accuracy.
WHAT YOU'RE LISTENING TO: Diane-verified phoneme recordings — adult prompt + child response, captured at age 2 and again at age 4. Each row is a letter; each column is one of the 4 voices. Diane labels which clips passed verification. F0 (pitch) is shown when measured.
STACK: iPhone Voice Memos for capture, librosa pyin for F0 extraction, JSON manifests for per-clip labels and prompt/response pairing
Child Voice Samples — 2yo and 4yo phoneme dataset
Diane-verified ground-truth recordings · adult prompt → child response · pitch-paired
"If you train a speech model on adult corpora and ship it to a 4-year-old, it fails badly — because a child's F0 is ~345 Hz versus an adult's ~226 Hz, and adult formants warp incorrectly. So I built the simplest possible ground-truth: my own kids, recorded over two years, every clip listened to and labeled. This is the dataset that grounds the VTLN factor and the developmental substitution table."
—
2yo adult prompts (Diane)
—
4yo adult prompts (Diane)
The 4 voices, side by side
2yo adult — Diane prompts a 2-year-old
2yo child — 2-year-old's response
4yo adult — Diane prompts a 4-year-old
4yo child — 4-year-old's response
What this dataset enabled
- VTLN factor — adult-to-child vocal-tract-length warping derived from per-letter F0 measurement
- Developmental substitution table — at age 2 a child says
/w/ for /r/; at 4 it's gone. This pattern is in the data and is gated by Sander 1972 norms in the classifier
- 97.4% across-manner classification on the WavLM fine-tune — using THIS data, not LibriSpeech
- QA loop closure — every shipped audio clip is verified against my child's actual pronunciations, not adult expectations
Audio source: projects/neural_phonics/data/{2yo,4yo}_reviewed/{adult,child}/
Mapping with F0 + duration per clip: projects/neural_phonics/data/{2yo,4yo}_reviewed/mapping.json
Used by: extract_features_vpdata.py for feature extraction · train_phoneme_lgbm.py for LGBM training · neural_finetune_v2.py for WavLM fine-tune