How It Works — proofagent

01

You write a test. It's just Python.

A proofagent test is a regular Python function. No YAML. No config files. No DSL to learn. If you've used pytest, you already know how to use proofagent.

from proofagent import expect, LLMResult

def test_my_agent():
    result = LLMResult(text="The answer is 42", cost=0.003)
    expect(result).contains("42").total_cost_under(0.01)

Offline by default. You can test assertions without any API key by constructing LLMResult directly. This is how most teams start — test the assertion logic first, add live API calls later.

02

The provider calls the model and captures everything.

When you use the proofagent_run fixture, proofagent calls the LLM through a thin adapter layer and captures the full response:

Text output — what the model said
Tool calls — which functions the agent invoked, with what arguments
Trajectory — the full multi-step conversation (for agent workflows)
Cost — calculated from token counts and model pricing
Latency — wall-clock time for the API call

All of this gets packed into an LLMResult object that your assertions run against.

Providers are interchangeable. Switch from OpenAI to Anthropic by changing one env var. The test code stays the same.

03

Assertions check the result. All chainable.

The expect() function wraps an LLMResult and returns an Expectation object with 19 built-in assertions. Every assertion returns self, so you chain them:

(
    expect(result)
    .contains("hello")                 # text contains substring
    .tool_calls_contain("search")       # agent called this tool
    .no_tool_call("delete")             # agent did NOT call this
    .total_cost_under(0.10)            # cost below threshold
    .latency_under(5.0)               # responded in time
    .trajectory_length_under(10)      # didn't loop forever
)

Each assertion raises AssertionError with a clear message on failure. Pytest catches it, marks the test as failed, and shows you exactly what went wrong.

Custom assertions in one line. For anything the built-ins don't cover:
expect(result).custom("is_polite", lambda r: "please" in r.text.lower())

04

pytest runs it. CI gates it. Dashboard shows it.

proofagent is a pytest plugin. It hooks into pytest's lifecycle to:

Register fixtures — proofagent_run, proofagent_compare, proofagent_dataset
Register markers — @pytest.mark.safety, @pytest.mark.agent for selective test runs
Print summaries — pass rate, failed test names, total cost after every run

# Run all tests
$ pytest tests/ -v

# Run only safety tests
$ pytest tests/ -m safety

# Gate deploys on pass rate
$ proofagent gate --min-score 0.95 --block-on-fail

# View results in the browser
$ proofagent dashboard --test tests/

In CI, if any test fails or the pass rate drops below your threshold, the pipeline exits with code 1. Bad deploys don't ship.

Architecture

The whole system in one diagram

How proofagent works

You write a test. It's just Python.

The provider calls the model and captures everything.

Assertions check the result. All chainable.

pytest runs it. CI gates it. Dashboard shows it.

The whole system in one diagram

That's it. No hidden complexity.