---
name: autoanything-optimizer
description: |
  Set up and run a DarwinDerby optimization problem for any artifact that can be improved by iterative agent edits and evaluated by either a normal Python scoring function or an LLM-based judge. Supports installing DarwinDerby from PyPI, scaffolding a new problem, wiring a Zo-based optimization agent, and establishing an initial baseline.
compatibility: Created for Zo Computer
metadata:
  author: rob.zo.computer
  version: 0.1
---
# autoanything-optimizer

General skill for turning an arbitrary optimization target into a DarwinDerby problem.

This skill is for situations where the user wants to optimize some mutable artifact over repeated agent iterations:
- an essay or manifesto
- a prompt
- a config file
- a piece of code
- a solver
- a benchmark target
- any file-backed state with a score function

The core loop is:
1. define mutable state in `state/`
2. define a scoring function in `scoring/score.py`
3. run a local agent repeatedly
4. keep only improvements

## What this skill helps with

- Install **DarwinDerby** locally from PyPI
- Scaffold a new problem with `derby init`
- Turn an existing file into the initial `state/` artifact
- Write a first-pass `problem.yaml`
- Write a first-pass scoring function
- Optionally wire scoring to an **LLM judge**
- Optionally wire a **Zo agent** that delegates edits to Zo through the `/zo/ask` API
- Validate the project and establish a baseline score
- Start cautiously with a tiny number of iterations, then scale up once the loop is trustworthy

## When to use this skill

Use this skill when the user wants a black-box optimization loop where:
- agents can change files
- the system can score the result
- only better results should persist

Good examples:
- “optimize this essay into a manifesto”
- “tune this prompt against a test set”
- “improve this config for some measurable objective”
- “set up an AutoAnything problem for this repo”

## Inputs the user should provide

At minimum, figure out:
- **what artifact is being optimized**
- **where it should live**
- **what score means**
- **whether the score is minimize or maximize**
- **whether scoring is deterministic Python or LLM-based judgment**

Helpful extra inputs:
- desired problem name
- target audience
- constraints / guardrails
- preferred agent strategy
- whether to use Zo as the editing agent

## Recommended workflow

### Phase 1 — Install / update DarwinDerby

Prefer installing the latest published package from PyPI rather than cloning the repo.

Recommended pattern:

```bash
uv tool install --force 'darwinderby[llm]'
```

Notes:
- Installing with `[llm]` is useful because it brings in `openai` and `anthropic` extras.
- The tool executable typically lands in `/root/.local/bin/derby`.
- If shell sessions do not see it, add `/root/.local/bin` to `PATH`.

## Phase 2 — Scaffold the problem

Use the CLI itself:

```bash
derby init <problem-name> --direction <minimize|maximize> --dir <parent-dir>
```

This creates a repo-like problem directory with:
- `problem.yaml`
- `agent_instructions.md`
- `state/`
- `context/`
- `scoring/score.py`
- `.autoanything/`
- `.gitignore`

Then:
- copy the target artifact into `state/`
- remove irrelevant scaffold files if needed
- rewrite `problem.yaml` for the real task
- add context docs into `context/`

## Phase 3 — Define scoring

Two major modes:

### A. Normal Python scoring

Use when the objective is directly measurable.
Examples:
- runtime
- test pass rate
- accuracy
- compression ratio
- file size
- benchmark throughput

Pattern:
```python
def score():
    # inspect state/
    # compute metrics
    return {"score": 0.83}
```

### B. LLM-based scoring

Use when the target is subjective or multi-dimensional.
Examples:
- manifesto quality
- essay quality
- prompt quality
- API ergonomics
- clarity / readability / persuasion

Pattern:
- define dimensions
- define weights
- ask an LLM judge for structured sub-scores
- aggregate to a single normalized score

Important guidance:
- Keep the **aggregate score** explicit and simple.
- Keep the **sub-dimensions** interpretable.
- The best scoring functions include examples of strong vs weak performance.
- Tell the judge to use the full scale.
- Expect calibration work; the first scoring prompt often bunches scores too tightly.

## Vercel AI Gateway scoring

LLM judging via **Vercel AI Gateway** is a valid pattern and works well for this kind of problem.

Basic Python approach:
- use the OpenAI Python client
- set `base_url="https://ai-gateway.vercel.sh/v1"`
- read `VERCEL_AI_GATEWAY_API_KEY` from env
- set a default model constant at the top of the file
- request structured JSON output from the judge

Typical setup:
```python
from openai import OpenAI

DEFAULT_MODEL = "openai/gpt-5.4"
client = OpenAI(
    api_key=os.environ["VERCEL_AI_GATEWAY_API_KEY"],
    base_url="https://ai-gateway.vercel.sh/v1",
)
```

This is especially useful when the score depends on dimensions like:
- conciseness
- interestingness
- excitement
- language rigor
- readability
- structure

But the same general approach works for many subjective artifacts.

## Scoring design advice

When writing an LLM scoring function:
- prefer a handful of dimensions over too many
- define each dimension precisely
- include examples of high vs low performance
- describe score bands clearly
- distinguish local edits from holistic improvements
- explicitly state what major improvements usually come from

Common failure mode:
- the judge collapses many drafts to the same score tuple

Common fixes:
- move from coarse 1–10 integers to finer 0–100 integers
- add stronger calibration language
- add better examples
- encourage major structural improvements over line edits

## Phase 4 — Write the optimization agent

The agent passed to `derby run -a ...` can be any command.

A strong general pattern is a **Zo agent wrapper** in Python that:
- gathers problem context
- reads the current state artifact
- reads `problem.yaml`, `context/`, `leaderboard.md`, `history.md`
- sends a detailed editing task to Zo via `/zo/ask`
- instructs Zo to directly modify files inside `state/`
- instructs Zo to make the git commit itself

Important: prefer **delegating the task directly** to Zo, rather than forcing Zo to return a JSON blob containing replacement file contents.

That means the wrapper should:
- assemble context
- call `/zo/ask`
- ask Zo to edit the file in place with tools
- ask Zo to `git add` and `git commit`
- print Zo’s plain-text result

This is cleaner and more robust than schema-constraining the agent’s output.

Required env:
- `ZO_CLIENT_IDENTITY_TOKEN`

Model pattern:
```python
MODEL_NAME = "openai:gpt-5.4-2026-03-05"
API_URL = "https://api.zo.computer/zo/ask"
```

## Phase 5 — Validate and baseline

After setup, always do:

```bash
derby validate
derby score
derby evaluate --baseline-only
```

Meaning:
- `validate` checks structure
- `score` runs the current scoring function once
- `evaluate --baseline-only` records the baseline into `.autoanything/history.db`

## Phase 6 — Start slow

Before running a lot of iterations:
1. run `derby score` manually
2. run `derby run -a "python <agent>.py" -n 1`
3. inspect the result
4. confirm the score is sensible
5. then try `-n 2` or `-n 5`
6. only later do `-n 10+`

This matters because many failures are setup failures, not search failures.

## How AutoAnything state works

For `derby run`:
- each iteration creates a temporary proposal branch
- the agent edits `state/`
- the branch is scored
- if accepted, it merges to `main`
- if rejected, it is discarded
- later attempts start from the current accepted `main`

So:
- rejected attempts do **not** accumulate state
- accepted attempts become the new starting point
- local proposal branches are usually deleted after evaluation

Persistent artifacts:
- `.autoanything/history.db`
- `history.md`
- `leaderboard.md`
- accepted commits on `main`

## Important operational gotchas

### 1. Tool environment vs system Python
If `derby` is installed as a uv tool, its Python environment is separate from system Python.
If your scoring function imports `openai`, that package must exist in the **tool env** used by `derby`, not just in system Python.

Best fix:
```bash
uv tool install --force 'darwinderby[llm]'
```

### 2. Hidden scoring directory during `run`
`derby run` may move `scoring/` out of sight during agent execution.
A robust `scoring/score.py` should discover the project root dynamically rather than assuming a fixed relative path.

### 3. Dirty repo causes invalid attempts
If `problem.yaml`, `context/`, agent files, or other non-state files are locally modified when you start `run`, attempts can be rejected as invalid.
Commit or stash non-state changes before looping.

### 4. LLM score calibration is iterative
If many attempts tie exactly, the loop may be working fine but the scoring function is too coarse or bunched.
Treat scoring prompt design as part of the optimization problem.

## Suggested implementation checklist

When running this skill for a new problem:

1. understand the artifact + score objective
2. install DarwinDerby from PyPI
3. scaffold the problem with `derby init`
5. move the real artifact into `state/`
6. rewrite `problem.yaml`
7. write `context/` docs if the agent/judge need them
8. write `scoring/score.py`
9. optionally write a Zo-based agent wrapper
10. run `derby validate`
11. run `derby score`
12. run `derby evaluate --baseline-only`
13. run `derby run -a ... -n 1`
14. inspect history / leaderboard / git log
15. iterate on scoring and agent prompts before scaling up

## What this skill should produce for the user

For a successful setup, leave them with:
- a fully scaffolded AutoAnything problem directory
- a usable `problem.yaml`
- a usable scoring function
- a usable agent command
- a baseline in the history DB
- one or two smoke-tested optimization iterations
- notes about known issues and next steps

## Output / project conventions

When implementing this skill in a real workspace:
- put the new problem in its own directory
- keep target artifacts in `state/`
- keep rubric and judge prompts in `context/`
- keep the scoring code in `scoring/`
- leave an `AGENTS.md` inside the problem directory documenting the setup and current state

## Suggested commands

```bash
# install latest tool from PyPI
uv tool install --force 'darwinderby[llm]'

# scaffold a new problem
derby init my-problem --direction maximize --dir /home/workspace
cd /home/workspace/my-problem

# validate and baseline
derby validate
derby score
derby evaluate --baseline-only

# first tiny run
derby run -a "python zo_agent.py" -n 1
```

## Bottom line
This skill is for setting up a new DarwinDerby optimization problem end-to-end.
It treats scoring design as first-class, supports both deterministic and LLM-based scoring, and encourages using Zo itself as the optimization agent when that is the most capable editor for the task.
