- 4-dimension gesture score (Spec Compliance / Silhouette / Distinctiveness / Safety) running per export
- 13 validators turning animation rules into testable contracts — geometry, motion, viseme coverage, structural integrity
- Append-only score log so I can see if a manifest change made things better or worse, not just whether it passed
- The objective function is in place; the next experiment loop is parameter sweeps over it
Validators as testable contracts
§1The 4-dimension score
Every wired .riv goes through a post-wire scorer that returns a structured report — score per dimension, every issue itemized, manifest hash for diffability:
# scripts/gesture_postwire_eval.py:304 dimensions = { "spec_compliance": DimensionScore( score=_score_from_penalty(25, spec_issues, 10), max_score=25, issues=spec_issues, # manifest preflight + semantic checks ), "silhouette": DimensionScore( score=_score_from_penalty(25, silhouette_issues, 6), max_score=25, issues=silhouette_issues, # SCALE: translation visibility at phone zoom ), "distinctiveness": DimensionScore( score=_score_from_penalty(25, distinct_issues, 8), max_score=25, issues=distinct_issues, # SIMILAR: shared params differ <25% ), "safety": DimensionScore( score=25 if not safety_issues else 0, # hard fail max_score=25, issues=safety_issues, # LIMIT/ILLEGAL + bounds ), }
Three of the four dimensions are penalty-from-25 with diminishing returns per issue. Safety is a hard gate: any breach (a pose outside its envelope, a missing runtime token) drops it to 0. You can ship a 70/100 with safety = 25; you cannot ship anything with safety = 0.
§2The silhouette test
"Silhouette" in this codebase is a semantic test, not a pixel one — it asks whether two gestures would be distinguishable at a glance on a small screen. The check: any two gestures sharing 2+ parameters must differ by more than 25% on at least one of them. If they don't, they will look the same:
# scripts/gesture_audit.py:175 — _check_distinctiveness def _check_distinctiveness(scaled: dict) -> list[str]: """Any two gestures sharing >=2 roles should differ by >25% on at least one.""" issues = [] gestures = list(scaled.keys()) for i, g1 in enumerate(gestures): for g2 in gestures[i+1:]: shared_roles = set(scaled[g1].keys()) & set(scaled[g2].keys()) if len(shared_roles) < 2: continue has_distinct = False for role in shared_roles: v1 = abs(scaled[g1][role]) v2 = abs(scaled[g2][role]) max_v = max(v1, v2, 0.01) diff_ratio = abs(v1 - v2) / max_v if diff_ratio > MIN_DISTINCT_RATIO: # 0.25 = 25% has_distinct = True break if not has_distinct: issues.append(f"SIMILAR: {g1} vs {g2} — shared roles {sorted(shared_roles)} " f"all differ by <25%. May look identical.") return issues
The body-relative scale check
The companion check enforces visibility: any translation needs to be at least 2% of body height to register on a phone screen at 50% zoom. A bounce of 2 pixels on a 1024-pixel artboard is invisible:
# scripts/gesture_audit.py:238 — _check_body_relative_scale idle_pct = (idle_bob / BODY_HEIGHT_PX) * 100 if idle_pct < MIN_VISIBLE_TRANSLATION_PCT: # 2.0 = 2% of body height issues.append( f"SCALE: idle root.y = {idle_bob:.0f}px = {idle_pct:.1f}% of body height. " f"Need >{MIN_VISIBLE_TRANSLATION_PCT}% to be visible at phone scale.")
Two thresholds I'd defend in an interview: MIN_DISTINCT_RATIO = 0.25 (perceptual minimum for "obviously different motion") and MIN_VISIBLE_TRANSLATION_PCT = 2.0 (perceptual minimum for "visible at phone scale"). Both are tunable; both are in code, not hidden in a config.
§3Geometry-aware validators
"Geometry-aware" is the layer of validators that treats the SVG as a body, not just shapes. They use the manifest's parts_metadata (pivot points, bounding boxes, parent-child relationships) to ask body-anatomy questions:
rest_geometry.py— extracts centerline points and reference geometry fromparts_metadata. The other validators use this as their ground truth for "where is the head, where are the eyes, where do the legs attach."containment_rules.py— pupils must stay inside the eyes. Ears must stay attached to the skull. Limbs must stay within the artboard. Crossed-eyed-pupil bug? This validator catches it before export.vertical_rules.py— head above chest, hips below shoulders. Catches an animal that flips upside down because someone inverted a Y axis.symmetry_rules.py— bilateral symmetry tolerance for paired parts (ears, eyes, arms). Default tolerance is 12% (DEFAULT_MIRROR_TOLERANCE_RATIO = 0.12) — a bilateral mismatch beyond that flags as "this animal looks lopsided."motion_envelope_validators.py— per-gesture rotation limits, pupil containment during eye motion. Reads fromgesture_pattern_specs.POSE_LIMITSdynamically — limits are config, not constants.
parts_metadata.
§4The full validator catalog
Thirteen validators in new_pipeline/validators/, organized by what they care about:
| Validator | Layer | What it checks |
|---|---|---|
pack_preflight.py | Structural | Required manifest fields, group existence, parts_metadata presence |
export_gate.py | Structural | Pre-export checks: group_map valid, all referenced groups exist, VM contracts met |
conflict_rules.py | Structural | Role naming + ID conflicts in group_map |
mouth_shape_validator.py | Structural | Viseme coverage — all 6 mouth shapes (closed / narrow / open / wide / rounded / smile) |
riv_structural.py | Structural | .riv file integrity: state machines, inputs, animations present |
rest_geometry.py | Geometric | Centerline + reference geometry extraction from parts_metadata |
containment_rules.py | Geometric | Pupils-in-eyes, ears-on-skull, limbs-in-artboard |
vertical_rules.py | Geometric | Head above chest, hips below shoulders — anatomical vertical order |
symmetry_rules.py | Geometric | Bilateral symmetry tolerance (paired parts diverge by <12%) |
motion_envelope_validators.py | Motion | Per-gesture rotation limits, pupil containment during animation |
visual_qa.py | Aesthetic | Weighted aggregator — Style 35% / Symmetry 25% / Vertical 20% / Containment 20% |
style_similarity.py | Aesthetic | Cosine similarity of candidate manifest to reference manifest on anchor roles |
regression_snapshot.py | Regression | Frozen 2026-04-13 gesture defaults — prevents circular eval dependencies |
Plus 40+ pytest test files in new_pipeline/tests/ exercising the validators against fixtures (canonical good manifest, bad-geometry manifest, missing-roles manifest, bad-visemes manifest, fox-legacy manifest, penguin-canonicalized manifest).
§5What a real evaluation report looks like
From workbench/beta-baseline-eval-report.json — an actual run on the chick baseline, captured during one of the iteration cycles:
missing runtime tokens: preflight_failed— safety hard-failidle must be calmest: idle.tail.rot=8 ≥ walk_in.tail.rot=8— semantic spec violationidle must be calmest: idle.ear.rot=8 ≥ thinking.ear.rot=8idle must be calmest: idle.trunk.rot=10 ≥ helping.trunk.rot=10- (+ 6 more "idle must be calmest" violations across tail/ear/trunk roles)
The report is JSON. The run is reproducible. The diff between two runs is a manifest hash plus an issue list — exactly the shape you want for "did this change make things better or worse?"
Append-only log at workbench/gesture_postwire_results.tsv tracks every iteration's score across animals. Score regression = blocking.
§6Tuning today, tuning tomorrow
Today. Gesture defaults live in scripts/gesture_defaults.py. They're hand-edited based on audit feedback — when a score regresses, I look at the issues, adjust the defaults, re-run. Audit-driven iteration. It works, but it's manual and the search direction is intuition.
Tomorrow. The 4-dimension scorer is the right shape for an Optuna sweep: it returns a single scalar (total score), the parameters to search are bounded (the ~30 magnitudes per gesture × per-animal scaling overrides), and the safety dimension is a clean penalty. The next iteration loop is hyperparameter optimization over the scorer:
- Objective: maximize total score (weighted across 4 dimensions, with safety as a hard penalty)
- Search space: per-gesture magnitudes (arm.rot, root.y, body.rot, etc.) within validator-allowed envelopes
- Constraints: physical limits from
POSE_LIMITS, regression snapshot floor (no run can score worse than 2026-04-13 baseline) - Why now and not earlier: the validators had to come first. You can't optimize without an objective function. With the validators in place, optimization is the natural next step.