Proving Your AI Agent's Skills: A Developer's Guide
Safety leaderboards tell you what your AI agent refuses to do. They don't tell you what it's actually good at. An agent that scores 100% on refusal tests but can't write working code or summarize a document accurately isn't useful to anyone. You need both halves of the picture.
This guide explains how skill proofs work, how LLM-as-judge scoring produces reliable scores, and how to run skill evaluations with proofagent.
The problem: safety is not competence
Most AI evaluation frameworks focus on safety — does the model refuse harmful requests? That's necessary but insufficient. Your users chose your product because of what it can do, not what it won't do.
When a team is choosing between AI models, they need to answer questions like:
- Can this model actually write correct Python code?
- Does it follow multi-step reasoning without losing the thread?
- Can it summarize a 2,000-word document without hallucinating details?
- Does it produce coherent, well-structured prose?
These are skill questions. They require a different kind of evaluation than binary pass/fail safety tests.
What skill proofs are
A skill proof is a standardized evaluation that measures your AI agent's capability in a specific domain. proofagent ships with five built-in skill categories:
Each category contains multiple test prompts with defined evaluation criteria. Unlike safety tests (which are binary pass/fail), skill tests produce a score from 0 to 10, giving you a granular view of model capability.
How LLM-as-judge scoring works
Skill evaluation uses a technique called LLM-as-judge. Here's the flow:
- Prompt — proofagent sends a skill challenge to your target model (e.g., "Write a Python function that merges two sorted lists")
- Response — Your model generates its answer
- Judging — A separate judge model (typically GPT-4o-mini) evaluates the response against a detailed rubric
- Score — The judge assigns a 0-10 score with written rationale
The judge model receives the original prompt, the rubric, and the response. It doesn't know which model produced the response, reducing bias. The written rationale makes every score auditable — you can read exactly why a response got a 7 instead of a 9.
Research shows that LLM-as-judge correlates well with human evaluations when the rubric is specific. proofagent's rubrics have been calibrated against human raters to minimize scoring drift.
Running skill proofs
Run skill evaluations with a single command:
# Run all skill evaluations
proofagent skills --provider openai --model gpt-4o
# Run a specific skill category
proofagent skills --provider openai --model gpt-4o --category coding
# Compare two models head-to-head
proofagent skills --provider openai --model gpt-4o --compare gpt-4o-mini
The output shows per-category scores with the judge's rationale for each:
proofagent skill results
========================
Model: gpt-4o (openai)
Coding 8.4 / 10 (5 tests)
Reasoning 7.9 / 10 (4 tests)
Math 9.1 / 10 (4 tests)
Writing 8.0 / 10 (3 tests)
Summarization 8.6 / 10 (3 tests)
Overall 8.4 / 10
Submitting to the leaderboard
Once you've run skill proofs, submit the results to the proofagent skills leaderboard to create a public record of your model's capabilities:
proofagent skills --provider openai --model gpt-4o --submit
The leaderboard shows scores across all five categories, making it easy to compare models at a glance. It's the skills counterpart to the safety leaderboard — together they give a complete picture of what a model will and won't do.
Custom skill packs for your domain
The built-in skill categories cover general capabilities, but your agent probably has domain-specific requirements. A legal AI needs to parse case citations. A medical agent needs to interpret lab results. A DevOps agent needs to write correct Kubernetes manifests.
proofagent supports custom skill packs — YAML files that define your own evaluation prompts and rubrics:
# my-skills.yaml
name: kubernetes-ops
tests:
- prompt: |
Write a Kubernetes deployment manifest for a
Node.js app with 3 replicas, resource limits,
and a readiness probe on /health.
rubric: |
Score 0-10 based on:
- Valid YAML syntax (2 pts)
- Correct apiVersion and kind (2 pts)
- 3 replicas specified (2 pts)
- Resource limits present (2 pts)
- Readiness probe on /health (2 pts)
category: kubernetes
# Run with custom skill pack
proofagent skills --provider openai --model gpt-4o --pack my-skills.yaml
Custom skill packs let you test exactly the capabilities that matter for your use case. They follow the same LLM-as-judge scoring pipeline, so results are consistent and auditable.
Putting it all together
A complete AI evaluation strategy tests both safety and skills. Start with the built-in suites to establish a baseline, then add custom skill packs for your domain. Run safety tests in CI on every change. Run skill evaluations weekly or when switching models.
The goal isn't a perfect score. It's knowing exactly where your agent excels and where it falls short, so you can make informed decisions about what to ship and what to hold back.
Explore the skills leaderboard to see how current models compare, then run the evaluation yourself.