01

You write a test. It's just Python.

A proofagent test is a regular Python function. No YAML. No config files. No DSL to learn. If you've used pytest, you already know how to use proofagent.

from proofagent import expect, LLMResult

def test_my_agent():
    result = LLMResult(text="The answer is 42", cost=0.003)
    expect(result).contains("42").total_cost_under(0.01)
Offline by default. You can test assertions without any API key by constructing LLMResult directly. This is how most teams start — test the assertion logic first, add live API calls later.
02

The provider calls the model and captures everything.

When you use the proofagent_run fixture, proofagent calls the LLM through a thin adapter layer and captures the full response:

All of this gets packed into an LLMResult object that your assertions run against.

Your test Provider adapter LLM API | | | |── proofagent_run() ──> | | | |── API call ──────────> | | | | | | <── response ──────── | | | | | | capture: text, tools, | | | cost, latency, tokens | | | | | <── LLMResult ──────── | |

Providers are interchangeable. Switch from OpenAI to Anthropic by changing one env var. The test code stays the same.

03

Assertions check the result. All chainable.

The expect() function wraps an LLMResult and returns an Expectation object with 19 built-in assertions. Every assertion returns self, so you chain them:

(
    expect(result)
    .contains("hello")                 # text contains substring
    .tool_calls_contain("search")       # agent called this tool
    .no_tool_call("delete")             # agent did NOT call this
    .total_cost_under(0.10)            # cost below threshold
    .latency_under(5.0)               # responded in time
    .trajectory_length_under(10)      # didn't loop forever
)

Each assertion raises AssertionError with a clear message on failure. Pytest catches it, marks the test as failed, and shows you exactly what went wrong.

Custom assertions in one line. For anything the built-ins don't cover:
expect(result).custom("is_polite", lambda r: "please" in r.text.lower())
04

pytest runs it. CI gates it. Dashboard shows it.

proofagent is a pytest plugin. It hooks into pytest's lifecycle to:

# Run all tests
$ pytest tests/ -v

# Run only safety tests
$ pytest tests/ -m safety

# Gate deploys on pass rate
$ proofagent gate --min-score 0.95 --block-on-fail

# View results in the browser
$ proofagent dashboard --test tests/

In CI, if any test fails or the pass rate drops below your threshold, the pipeline exits with code 1. Bad deploys don't ship.

Architecture

The whole system in one diagram

Your test file | ├── expect(result).contains().refused() | | | └── Assertion engine (expect.py) | Chainable checks, raises AssertionError on fail | ├── proofagent_run("prompt", model="claude-sonnet-4-6") | | | └── Provider adapter (providers/) | OpenAI | Anthropic | Gemini | Ollama | Captures: text, tools, cost, latency, trajectory | Returns: LLMResult | └── pytest collects + runs | ├── Plugin (plugin.py) — fixtures, markers, summary ├── Dashboard — local web UI for results └── Gate — CI pass/fail threshold

That's it. No hidden complexity.

Install, write a test, run pytest. Takes 30 seconds to start.