persona-garden-patch

What Persona Evaluation and Testing Infrastructure Does the Garden Need

Research Question

Production persona systems use quantitative evaluation: Big Five personality scoring, persona vector monitoring, Psychological Stability Index drift detection, pass@k consistency metrics for behavioral scenarios. The garden’s persona architecture has no equivalent evaluation or testing infrastructure. What would persona evaluation look like for the garden’s operational agents — Gardeners, Groundskeepers, Chancellors — whose personas are defined in garden persona nodes and rendered into Claude Code agent files?

What Is Being Determined

Whether garden personas need formal evaluation at all, and if so, what metrics and methods are appropriate for operational (not conversational) agent personas. The garden’s agents do knowledge management work — creating nodes, triaging content, running commissions — rather than engaging in open-ended conversation. Whether that distinction changes what “persona stability” means is itself one of the questions.

Open Questions

1. Behavioral Consistency Testing: The deterministic personality research ran agents through 50+ diverse scenarios and measured whether behavioral outputs matched persona specifications. The same approach applied to garden agents would mean: given a sample commission, does the Gardener behave as its persona declares? Does it “prefer incremental change over large rewrites”? Does it “create ghost links rather than speculative content”? A scenario suite for garden personas would need to define what compliance looks like for each behavioral specification — which requires those specifications to be operationally precise enough to pass or fail.

2. Drift Detection: [[Persona Drift Causes Detection and Prevention]] identifies a key finding from the Assistant Axis research: emotional conversations cause 7.3x drift acceleration compared to neutral task completion. Garden agents operate in extended sessions that occasionally involve ambiguous or contested content decisions. Should there be a mechanism — perhaps a behavioral anchor check at commission boundaries — to detect when a Gardener has drifted from its persona specification during a session? The commission return is already a boundary event; it may be a natural evaluation point.

3. Role Adherence Metrics: Voice agent evaluation uses role adherence — “whether the bot maintains its persona throughout the conversation.” For garden agents, role adherence might mean something more structural: does the Gardener maintain its scope constraints (not editing outside commission scope)? Does it follow its commit discipline? Does it apply its escalation protocol when encountering boundary questions? These are measurable — they show up in commit histories and session logs — but currently unmeasured.

4. The Ontological Error Question: Psychometric instruments (Big Five, Myers-Briggs, HEXACO) were developed and validated on human respondents. Research on applying them to LLMs finds that models score consistently but that the constructs being measured may differ from human constructs — the instrument may not be measuring personality in the same sense it measures human personality. If garden personas are evaluated, what constructs should be measured? And are those constructs better captured by instruments developed specifically for LLM behavioral evaluation than by instruments borrowed from human psychology? The answer may affect whether existing psychometric research transfers to garden persona design.

Sources

Relations