Four concepts. No magic. You can read the entire source in 20 minutes.
A proofagent test is a regular Python function. No YAML. No config files. No DSL to learn. If you've used pytest, you already know how to use proofagent.
from proofagent import expect, LLMResult def test_my_agent(): result = LLMResult(text="The answer is 42", cost=0.003) expect(result).contains("42").total_cost_under(0.01)
LLMResult directly. This is how most teams start — test the assertion logic first, add live API calls later.
When you use the proofagent_run fixture, proofagent calls the LLM through a thin adapter layer and captures the full response:
All of this gets packed into an LLMResult object that your assertions run against.
Providers are interchangeable. Switch from OpenAI to Anthropic by changing one env var. The test code stays the same.
The expect() function wraps an LLMResult and returns an Expectation object with 19 built-in assertions. Every assertion returns self, so you chain them:
(
expect(result)
.contains("hello") # text contains substring
.tool_calls_contain("search") # agent called this tool
.no_tool_call("delete") # agent did NOT call this
.total_cost_under(0.10) # cost below threshold
.latency_under(5.0) # responded in time
.trajectory_length_under(10) # didn't loop forever
)
Each assertion raises AssertionError with a clear message on failure. Pytest catches it, marks the test as failed, and shows you exactly what went wrong.
expect(result).custom("is_polite", lambda r: "please" in r.text.lower())
proofagent is a pytest plugin. It hooks into pytest's lifecycle to:
proofagent_run, proofagent_compare, proofagent_dataset@pytest.mark.safety, @pytest.mark.agent for selective test runs# Run all tests $ pytest tests/ -v # Run only safety tests $ pytest tests/ -m safety # Gate deploys on pass rate $ proofagent gate --min-score 0.95 --block-on-fail # View results in the browser $ proofagent dashboard --test tests/
In CI, if any test fails or the pass rate drops below your threshold, the pipeline exits with code 1. Bad deploys don't ship.
Install, write a test, run pytest. Takes 30 seconds to start.