Run standardized skill evaluations against your AI agent. Get scored by an LLM judge. Submit your results to the public leaderboard. Scores reflect a single evaluation run — not a comprehensive assessment.
proofagent skill submit <path>Install proofagent, run the skill proof, submit it. That's it.
pip install proofagent
# Test all 5 skills (coding, reasoning, math, writing, summarization) proofagent skill run claude-sonnet-4-6 --skill all # Or test just one skill proofagent skill run claude-sonnet-4-6 --skill coding # Use a different judge model proofagent skill run gpt-4o --skill all --judge anthropic/claude-sonnet-4-6
proofagent skill run — Skill Proof Model: claude-sonnet-4-6 Judge: openai/gpt-4o-mini Running 5 skill packs: coding, reasoning, math, writing, summarization Skill Score Grade ------------------ ------- ------- Coding 9.2/10 A+ Reasoning 9.0/10 A+ Math 8.8/10 A Writing 9.4/10 A+ Summarization 9.2/10 A+ Overall: 9.1/10 — Grade: A+ Cost: $0.0847 Saved: .proofagent/skills/skill_claude-sonnet-4-6_20260318_120000.json Fingerprint: a3f8c2d1e5b74901
# Submit the proof file proofagent skill submit .proofagent/skills/skill_claude-sonnet-4-6_20260318_120000.json # Or submit from a URL proofagent skill submit https://your-site.com/proof.json
proofagent skill submit Model: claude-sonnet-4-6 Fingerprint: a3f8c2d1e5b74901 Coding 9.2/10 A+ Reasoning 9.0/10 A+ Math 8.8/10 A Writing 9.4/10 A+ Summarization 9.2/10 A+ Overall: 9.1/10 — Grade: A+ Added claude-sonnet-4-6 to leaderboard Saved: .proofagent/leaderboard.json # Model Score Grade ---- ------------------------- -------- ------- 1 claude-sonnet-4-6 9.1 A+ <-
Add skill assertions directly in your pytest suite with the skill_score() assertion.
from proofagent import expect, LLMResult def test_agent_can_code(): result = my_agent.run("Write a fibonacci function") expect(result).skill_score( rubric="Correct fibonacci implementation, handles edge cases, clean code", min_score=7, ) def test_agent_can_summarize(): result = my_agent.run("Summarize the key points of this document: ...") expect(result).skill_score( rubric="Accurate, concise summary covering all main points", min_score=6, )
Paste a proof URL or the raw JSON from your proof file. We'll validate it and add your model to the board.
Note: Submissions are stored locally in your browser. A persistent leaderboard is coming soon.
5 challenges per skill, each with a rubric defining what "good" looks like. No ambiguity.
An LLM judge scores each response 0-10 against the rubric at temperature 0. Scores may vary slightly between runs.
Results get a SHA-256 fingerprint for tamper detection. This is an evaluation report, not a certification.
Three commands. Verifiable scores. On the leaderboard.