← Back to Phoneme Classifier overview · Portfolio home →

PROBLEM: A speech model trained on adult corpora misclassifies a 4-year-old. Children have higher fundamental frequency, shorter vocal tracts, and different formant ratios. The model needs ground-truth child recordings — but COPPA-friendly child speech datasets don't exist for this use case.

WHY IT MATTERS: No anonymous internet "child speech" dataset is COPPA-safe to train on. The pragmatic answer: I record my own children, label every clip myself, and gate by parental consent (mine). This is the dataset that backs the VTLN normalization factor, the developmental substitution table, and the 97.4% across-manner accuracy.

WHAT YOU'RE LISTENING TO: Diane-verified phoneme recordings — adult prompt + child response, captured at age 2 and again at age 4. Each row is a letter; each column is one of the 4 voices. Diane labels which clips passed verification. F0 (pitch) is shown when measured.

STACK: iPhone Voice Memos for capture, librosa pyin for F0 extraction, JSON manifests for per-clip labels and prompt/response pairing

Child Voice Samples — 2yo and 4yo phoneme dataset

Diane-verified ground-truth recordings · adult prompt → child response · pitch-paired

"If you train a speech model on adult corpora and ship it to a 4-year-old, it fails badly — because a child's F0 is ~345 Hz versus an adult's ~226 Hz, and adult formants warp incorrectly. So I built the simplest possible ground-truth: my own kids, recorded over two years, every clip listened to and labeled. This is the dataset that grounds the VTLN factor and the developmental substitution table."

—

2yo adult prompts (Diane)

—

2yo child responses

—

4yo adult prompts (Diane)

—

4yo child responses

The 4 voices, side by side

2yo adult — Diane prompts a 2-year-old 2yo child — 2-year-old's response 4yo adult — Diane prompts a 4-year-old 4yo child — 4-year-old's response

Letter

2yo · adult

2yo · child

4yo · adult

4yo · child

What this dataset enabled

VTLN factor — adult-to-child vocal-tract-length warping derived from per-letter F0 measurement
Developmental substitution table — at age 2 a child says /w/ for /r/; at 4 it's gone. This pattern is in the data and is gated by Sander 1972 norms in the classifier
97.4% across-manner classification on the WavLM fine-tune — using THIS data, not LibriSpeech
QA loop closure — every shipped audio clip is verified against my child's actual pronunciations, not adult expectations

Audio source: projects/neural_phonics/data/{2yo,4yo}_reviewed/{adult,child}/
Mapping with F0 + duration per clip: projects/neural_phonics/data/{2yo,4yo}_reviewed/mapping.json
Used by: extract_features_vpdata.py for feature extraction · train_phoneme_lgbm.py for LGBM training · neural_finetune_v2.py for WavLM fine-tune