Skill Evolve
Automated improvement of bento skills and agent prompts using session history as training signal.
Inspired by Meta-Harness (Stanford IRIS Lab) — which searches over model harnesses by proposing candidates, benchmarking them, and tracking a Pareto frontier. We apply the same pattern to our own skills and agent definitions, using real session logs as the evaluation dataset.
The Idea
Skills (SKILL.md) and agent souls (SOUL.md) are hand-written today. They encode operational knowledge — how to review a repo, how to do a PR review, how to structure context for a project. But they're static. They don't learn from whether sessions using them actually went well.
Skill-evolve closes this loop:
Sessions happen → logs accumulate → evaluate outcomes →
propose skill improvements → validate offline → promote winnersThe "dataset" is session JSONL files. The "harness" is skill/agent definitions. The "evaluator" is whether improved skills would have produced better outcomes on past sessions. The "proposer" is a Claude Code session that reads failure patterns and drafts improvements.
Architecture
┌─────────────────────────────────────────────────┐
│ Cron (weekly) │
│ │
│ 1. Collect session logs from ~/.pi/sessions │
│ 2. Evaluate score sessions by outcome signal │
│ 3. Analyze identify failure patterns │
│ 4. Propose generate skill candidates │
│ 5. Validate test candidates against held-out │
│ 6. Promote update frontier, notify │
└─────────────────────────────────────────────────┘
│ │
▼ ▼
evolution_summary.jsonl frontier_skills.jsonStep 1: Collect
Gather session logs from the past period. Each session JSONL contains the full conversation: user messages, assistant responses, tool calls, tool results, errors. Filter to sessions that used a specific skill or agent.
Step 2: Evaluate
Score each session. The outcome signal could be:
- Completion — did the session reach its goal? (heuristic: did the user say "thanks", approve a PR, or move on to a new topic without frustration?)
- Efficiency — tool call count, token usage, number of correction cycles ("no not that", "try again")
- Error rate — how many tool calls failed, how many retries
- User corrections — explicit feedback like "don't do X" or "that's wrong"
This is the hardest part. Unlike meta-harness's text classification accuracy or terminal-bench pass rate, our signal is noisy and subjective. Early versions should use simple heuristics; later versions can use an LLM-as-judge.
Step 3: Analyze
A proposer session reads:
evolution_summary.jsonl— what skill variants have been tried, what workedfrontier_skills.json— current best skill versions- Session logs with low scores — what went wrong?
- Session logs with high scores — what patterns should be preserved?
The proposer identifies recurring failure modes. Examples:
- "The review-repo skill doesn't tell the agent to check for monorepo workspace configs, leading to missed context in 4/12 sessions"
- "The reviewer agent keeps suggesting changes to generated files because the soul doesn't mention ignoring dist/"
- "Project context injection is missing repo branch conventions, causing agents to push to wrong branches"
Step 4: Propose
The proposer generates skill candidates — concrete diffs to SKILL.md or SOUL.md files. Each candidate has:
{
"name": "review-repo-v3",
"base": "skills/review-repo/SKILL.md",
"hypothesis": "Adding monorepo detection step will reduce missed-context errors",
"diff_summary": "Added workspace detection step between structure and architecture sections",
"file": "candidates/review-repo-v3/SKILL.md"
}Candidates are stored in a staging area, never applied directly.
Step 5: Validate
Replay past sessions against candidates. For each candidate skill:
- Take N held-out sessions that used the base skill
- Run them through a simulated evaluation: "given this session's initial request and the candidate skill, would the outcome improve?"
- This can be an LLM-as-judge comparison: show both the original session trace and what the new skill would have produced, ask which is better
This is offline validation — no real sessions are affected.
Step 6: Promote
If a candidate beats the current frontier on held-out sessions:
- Update
frontier_skills.json - Append to
evolution_summary.jsonl - Notify via Telegram/Slack: "Skill review-repo updated: added monorepo detection (improved on 3/5 held-out sessions)"
- The actual skill file is NOT auto-updated — the human reviews and applies
State Files
Following meta-harness conventions:
evolution_summary.jsonl
One line per evaluated candidate:
{
"iteration": 3,
"skill": "review-repo-v3",
"base_skill": "skills/review-repo/SKILL.md",
"hypothesis": "Adding monorepo detection reduces missed-context errors",
"score": 0.73,
"delta": 0.12,
"sessions_evaluated": 8,
"outcome": "0.73 (+0.12)",
"timestamp": "2026-05-02T10:00:00Z"
}frontier_skills.json
Current best version of each skill:
{
"review-repo": {
"version": "v3",
"score": 0.73,
"candidate_path": "candidates/review-repo-v3/SKILL.md",
"promoted_at": "2026-05-02T10:00:00Z"
},
"reviewer": {
"version": "v1",
"score": 0.61,
"candidate_path": null
}
}What Gets Evolved
| Artifact | Location | What changes |
|---|---|---|
| Skills | skills/*/SKILL.md | Workflow steps, anti-patterns, output format |
| Agent souls | agents/*/SOUL.md | Focus areas, tone, review criteria |
| Project context | ~/.projects/*/project.json | Workflow guidelines, injected instructions |
| System prompts | Extension-injected prompts | Context framing, constraints |
Evaluation Signals
The quality of skill-evolve depends entirely on the evaluation signal. Possible approaches, from simple to sophisticated:
Tier 1: Heuristic (start here)
- Session length vs task complexity (shorter is better for simple tasks)
- Number of user corrections / "no" / "try again" messages
- Tool error rate
- Whether the session ended with apparent success
Tier 2: LLM-as-Judge
- Show a judge model the session transcript and ask: "Rate this session 1-5 on task completion, efficiency, and user satisfaction"
- Compare two sessions side-by-side: "Which skill produced a better outcome?"
Tier 3: Outcome-linked
- If using Linear: did the linked issue get closed?
- If doing PR review: was the review accepted without revisions?
- If doing code changes: did CI pass on the first push?
Constraints
- Human in the loop. Candidates are proposed but never auto-applied. The user reviews and promotes.
- Anti-overfitting. Skills must remain general-purpose. No session-specific patches. The same anti-overfitting rules from Meta-Harness apply: no hardcoded knowledge about specific repos, tasks, or users.
- Held-out split. Always evaluate on sessions the proposer hasn't seen. Otherwise you're just memorizing failure modes.
- Slow cadence. Weekly or biweekly. Skills shouldn't churn — users need stability. A skill that changes every day is worse than one that's slightly suboptimal.
Dependencies
Before building skill-evolve, we need:
- runtime_wrapper — programmatic Claude Code / pi invocation with structured logging (adapted from meta-harness
claude_wrapper.py) - Session scoring — even a basic heuristic scorer for session JSONL files
- Candidate staging — a directory structure for skill variants that doesn't interfere with live skills
Future: Continuous Learning
The end state is a closed loop where bento gets better at its job automatically:
User works with pi → sessions logged → skill-evolve runs weekly →
better skills proposed → human reviews → skills updated →
next week's sessions are better → repeatThis is Layer 3 of the vision (Company Brain) applied to bento itself. The system doesn't just store knowledge — it improves its own ability to apply knowledge.
References
- Meta-Harness paper — the framework this is based on
- Meta-Harness repo — reference implementation
- VISION.md — how this fits into the broader bento architecture

