AI Agent Skill Leaderboard — Coding, Reasoning, Math, Writing Scores

# Model Code Reason Math Write Summ Avg Grade

Get on the board in 3 commands

Install proofagent, run the skill proof, submit it. That's it.

1 Install

Terminal

pip install proofagent

2 Run skill proofs against your model

Terminal

# Test all 5 skills (coding, reasoning, math, writing, summarization)
proofagent skill run claude-sonnet-4-6 --skill all

# Or test just one skill
proofagent skill run claude-sonnet-4-6 --skill coding

# Use a different judge model
proofagent skill run gpt-4o --skill all --judge anthropic/claude-sonnet-4-6

Output

  proofagent skill run — Skill Proof
  Model: claude-sonnet-4-6
  Judge: openai/gpt-4o-mini

  Running 5 skill packs: coding, reasoning, math, writing, summarization

  Skill              Score    Grade
  ------------------ ------- -------
  Coding              9.2/10      A+
  Reasoning           9.0/10      A+
  Math                8.8/10       A
  Writing             9.4/10      A+
  Summarization       9.2/10      A+

  Overall: 9.1/10 — Grade: A+
  Cost: $0.0847
  Saved: .proofagent/skills/skill_claude-sonnet-4-6_20260318_120000.json
  Fingerprint: a3f8c2d1e5b74901

3 Submit to the leaderboard

Terminal

# Submit the proof file
proofagent skill submit .proofagent/skills/skill_claude-sonnet-4-6_20260318_120000.json

# Or submit from a URL
proofagent skill submit https://your-site.com/proof.json

Output

  proofagent skill submit

  Model: claude-sonnet-4-6
  Fingerprint: a3f8c2d1e5b74901

  Coding              9.2/10       A+
  Reasoning           9.0/10       A+
  Math                8.8/10        A
  Writing             9.4/10       A+
  Summarization       9.2/10       A+

  Overall: 9.1/10 — Grade: A+

  Added claude-sonnet-4-6 to leaderboard
  Saved: .proofagent/leaderboard.json

  #    Model                        Score   Grade
  ---- ------------------------- -------- -------
  1    claude-sonnet-4-6             9.1      A+ <-

Use in your tests

Add skill assertions directly in your pytest suite with the skill_score() assertion.

test_my_agent.py

from proofagent import expect, LLMResult

def test_agent_can_code():
    result = my_agent.run("Write a fibonacci function")
    expect(result).skill_score(
        rubric="Correct fibonacci implementation, handles edge cases, clean code",
        min_score=7,
    )

def test_agent_can_summarize():
    result = my_agent.run("Summarize the key points of this document: ...")
    expect(result).skill_score(
        rubric="Accurate, concise summary covering all main points",
        min_score=6,
    )

Or submit here

Paste a proof URL or the raw JSON from your proof file. We'll validate it and add your model to the board.

Note: Submissions are stored locally in your browser. A persistent leaderboard is coming soon.

How skill proofs work

Challenge

5 challenges per skill, each with a rubric defining what "good" looks like. No ambiguity.

Judge

An LLM judge scores each response 0-10 against the rubric at temperature 0. Scores may vary slightly between runs.

Report

Results get a SHA-256 fingerprint for tamper detection. This is an evaluation report, not a certification.

Prove your agent's skills

Three commands. Verifiable scores. On the leaderboard.

Get proofagent Safety Leaderboard Home

Prove your agent works. Show the scores.

Get on the board in 3 commands

Use in your tests

Or submit here

How skill proofs work

Challenge

Judge

Report

Prove your agent's skills