← Rive animation overview
Problem: "Does this gesture look right?" was previously a human-eyeballs-it question. With 10 animals × 9 gestures, that doesn't scale. Bugs (360° arm rotation, cross-eyed pupils, idle bouncing more than walk) shipped because tests didn't catch them.
What I built:
  • 4-dimension gesture score (Spec Compliance / Silhouette / Distinctiveness / Safety) running per export
  • 13 validators turning animation rules into testable contracts — geometry, motion, viseme coverage, structural integrity
  • Append-only score log so I can see if a manifest change made things better or worse, not just whether it passed
  • The objective function is in place; the next experiment loop is parameter sweeps over it
Result: Bugs that used to show up in the previewer now fail in the test before the previewer opens. Score is a number; tuning is a process.
Stack: Python (pytest, custom scorer), JSON reports, append-only TSV logs. Optuna sweeps queued as the next iteration loop.
Learn to Read · Rive Animation · Evaluation

Validators as testable contracts

Animation rules — "idle should be the calmest gesture", "celebrating should be the biggest", "pupils stay inside eyes" — turned into validators that fail in CI before they fail on the kid's screen. Plus a 4-dimension score that turns "does it look right?" into a number.

§1The 4-dimension score

Every wired .riv goes through a post-wire scorer that returns a structured report — score per dimension, every issue itemized, manifest hash for diffability:

Spec Compliance
/25
Manifest preflight + semantic rules (idle calmest, celebrating biggest)
Silhouette
/25
Translations visible at 50% zoom on a 300dp phone canvas
Distinctiveness
/25
No two gestures share parameters that differ by less than 25%
Safety
/25
Pose limits, runtime contract, bounds — hard-fail (0 or 25)
# scripts/gesture_postwire_eval.py:304

dimensions = {
    "spec_compliance": DimensionScore(
        score=_score_from_penalty(25, spec_issues, 10),
        max_score=25,
        issues=spec_issues,        # manifest preflight + semantic checks
    ),
    "silhouette": DimensionScore(
        score=_score_from_penalty(25, silhouette_issues, 6),
        max_score=25,
        issues=silhouette_issues,  # SCALE: translation visibility at phone zoom
    ),
    "distinctiveness": DimensionScore(
        score=_score_from_penalty(25, distinct_issues, 8),
        max_score=25,
        issues=distinct_issues,    # SIMILAR: shared params differ <25%
    ),
    "safety": DimensionScore(
        score=25 if not safety_issues else 0,    # hard fail
        max_score=25,
        issues=safety_issues,      # LIMIT/ILLEGAL + bounds
    ),
}

Three of the four dimensions are penalty-from-25 with diminishing returns per issue. Safety is a hard gate: any breach (a pose outside its envelope, a missing runtime token) drops it to 0. You can ship a 70/100 with safety = 25; you cannot ship anything with safety = 0.

§2The silhouette test

"Silhouette" in this codebase is a semantic test, not a pixel one — it asks whether two gestures would be distinguishable at a glance on a small screen. The check: any two gestures sharing 2+ parameters must differ by more than 25% on at least one of them. If they don't, they will look the same:

# scripts/gesture_audit.py:175 — _check_distinctiveness

def _check_distinctiveness(scaled: dict) -> list[str]:
    """Any two gestures sharing >=2 roles should differ by >25% on at least one."""
    issues = []
    gestures = list(scaled.keys())

    for i, g1 in enumerate(gestures):
        for g2 in gestures[i+1:]:
            shared_roles = set(scaled[g1].keys()) & set(scaled[g2].keys())
            if len(shared_roles) < 2:
                continue

            has_distinct = False
            for role in shared_roles:
                v1 = abs(scaled[g1][role])
                v2 = abs(scaled[g2][role])
                max_v = max(v1, v2, 0.01)
                diff_ratio = abs(v1 - v2) / max_v
                if diff_ratio > MIN_DISTINCT_RATIO:    # 0.25 = 25%
                    has_distinct = True
                    break

            if not has_distinct:
                issues.append(f"SIMILAR: {g1} vs {g2} — shared roles {sorted(shared_roles)} "
                              f"all differ by <25%. May look identical.")

    return issues

The body-relative scale check

The companion check enforces visibility: any translation needs to be at least 2% of body height to register on a phone screen at 50% zoom. A bounce of 2 pixels on a 1024-pixel artboard is invisible:

# scripts/gesture_audit.py:238 — _check_body_relative_scale

idle_pct = (idle_bob / BODY_HEIGHT_PX) * 100
if idle_pct < MIN_VISIBLE_TRANSLATION_PCT:    # 2.0 = 2% of body height
    issues.append(
        f"SCALE: idle root.y = {idle_bob:.0f}px = {idle_pct:.1f}% of body height. "
        f"Need >{MIN_VISIBLE_TRANSLATION_PCT}% to be visible at phone scale.")

Two thresholds I'd defend in an interview: MIN_DISTINCT_RATIO = 0.25 (perceptual minimum for "obviously different motion") and MIN_VISIBLE_TRANSLATION_PCT = 2.0 (perceptual minimum for "visible at phone scale"). Both are tunable; both are in code, not hidden in a config.

§3Geometry-aware validators

"Geometry-aware" is the layer of validators that treats the SVG as a body, not just shapes. They use the manifest's parts_metadata (pivot points, bounding boxes, parent-child relationships) to ask body-anatomy questions:

  • rest_geometry.py — extracts centerline points and reference geometry from parts_metadata. The other validators use this as their ground truth for "where is the head, where are the eyes, where do the legs attach."
  • containment_rules.py — pupils must stay inside the eyes. Ears must stay attached to the skull. Limbs must stay within the artboard. Crossed-eyed-pupil bug? This validator catches it before export.
  • vertical_rules.py — head above chest, hips below shoulders. Catches an animal that flips upside down because someone inverted a Y axis.
  • symmetry_rules.py — bilateral symmetry tolerance for paired parts (ears, eyes, arms). Default tolerance is 12% (DEFAULT_MIRROR_TOLERANCE_RATIO = 0.12) — a bilateral mismatch beyond that flags as "this animal looks lopsided."
  • motion_envelope_validators.py — per-gesture rotation limits, pupil containment during eye motion. Reads from gesture_pattern_specs.POSE_LIMITS dynamically — limits are config, not constants.
Why these matter to a pipeline pattern: animation bugs aren't just "ugly". They violate body anatomy in deterministic ways. Geometry-aware validators encode the anatomy, which means new animals get the same checks for free as long as their manifest declares parts_metadata.

§4The full validator catalog

Thirteen validators in new_pipeline/validators/, organized by what they care about:

ValidatorLayerWhat it checks
pack_preflight.pyStructuralRequired manifest fields, group existence, parts_metadata presence
export_gate.pyStructuralPre-export checks: group_map valid, all referenced groups exist, VM contracts met
conflict_rules.pyStructuralRole naming + ID conflicts in group_map
mouth_shape_validator.pyStructuralViseme coverage — all 6 mouth shapes (closed / narrow / open / wide / rounded / smile)
riv_structural.pyStructural.riv file integrity: state machines, inputs, animations present
rest_geometry.pyGeometricCenterline + reference geometry extraction from parts_metadata
containment_rules.pyGeometricPupils-in-eyes, ears-on-skull, limbs-in-artboard
vertical_rules.pyGeometricHead above chest, hips below shoulders — anatomical vertical order
symmetry_rules.pyGeometricBilateral symmetry tolerance (paired parts diverge by <12%)
motion_envelope_validators.pyMotionPer-gesture rotation limits, pupil containment during animation
visual_qa.pyAestheticWeighted aggregator — Style 35% / Symmetry 25% / Vertical 20% / Containment 20%
style_similarity.pyAestheticCosine similarity of candidate manifest to reference manifest on anchor roles
regression_snapshot.pyRegressionFrozen 2026-04-13 gesture defaults — prevents circular eval dependencies

Plus 40+ pytest test files in new_pipeline/tests/ exercising the validators against fixtures (canonical good manifest, bad-geometry manifest, missing-roles manifest, bad-visemes manifest, fox-legacy manifest, penguin-canonicalized manifest).

§5What a real evaluation report looks like

From workbench/beta-baseline-eval-report.json — an actual run on the chick baseline, captured during one of the iteration cycles:

Round Chick Rive Ready · iteration 1 · 2026-04-10
70/100
Spec Compliance
20 / 25
Silhouette
25 / 25
Distinctiveness
25 / 25
Safety
0 / 25
Issues:
  • missing runtime tokens: preflight_failed — safety hard-fail
  • idle must be calmest: idle.tail.rot=8 ≥ walk_in.tail.rot=8 — semantic spec violation
  • idle must be calmest: idle.ear.rot=8 ≥ thinking.ear.rot=8
  • idle must be calmest: idle.trunk.rot=10 ≥ helping.trunk.rot=10
  • (+ 6 more "idle must be calmest" violations across tail/ear/trunk roles)

The report is JSON. The run is reproducible. The diff between two runs is a manifest hash plus an issue list — exactly the shape you want for "did this change make things better or worse?"

Append-only log at workbench/gesture_postwire_results.tsv tracks every iteration's score across animals. Score regression = blocking.

§6Tuning today, tuning tomorrow

Today. Gesture defaults live in scripts/gesture_defaults.py. They're hand-edited based on audit feedback — when a score regresses, I look at the issues, adjust the defaults, re-run. Audit-driven iteration. It works, but it's manual and the search direction is intuition.

Tomorrow. The 4-dimension scorer is the right shape for an Optuna sweep: it returns a single scalar (total score), the parameters to search are bounded (the ~30 magnitudes per gesture × per-animal scaling overrides), and the safety dimension is a clean penalty. The next iteration loop is hyperparameter optimization over the scorer:

  • Objective: maximize total score (weighted across 4 dimensions, with safety as a hard penalty)
  • Search space: per-gesture magnitudes (arm.rot, root.y, body.rot, etc.) within validator-allowed envelopes
  • Constraints: physical limits from POSE_LIMITS, regression snapshot floor (no run can score worse than 2026-04-13 baseline)
  • Why now and not earlier: the validators had to come first. You can't optimize without an objective function. With the validators in place, optimization is the natural next step.
Honest framing: Optuna sweeps are queued, not deployed. The validators landed on 2026-04-13 and the regression snapshot exists; the sweep harness is the next sprint. The interesting engineering claim isn't "I ran a sweep." It's "I built the objective function before the sweep, so the sweep is well-posed when it runs."
A common pattern in ML pipelines: people optimize before they have a credible objective. Then they spend three months tuning hyperparameters against a metric that doesn't predict success. The expensive thing isn't the sweep — it's the eval. I built the eval first.