An open-source Python framework for testing AI agents. You write assertions against LLM outputs — text content, tool calls, cost, latency, multi-step trajectories — and run them with pytest. Think of it as unit tests for AI.
pip install proofagent
For provider support: pip install "proofagent[openai]", "proofagent[anthropic]", "proofagent[gemini]", or "proofagent[all]".
from proofagent import expect, LLMResult def test_output(): result = LLMResult(text="The capital of France is Paris.") expect(result).contains("Paris")
Save as test_example.py, run pytest. That's it.
No. You can test entirely offline by constructing LLMResult objects directly. API keys are only needed if you want to call a live model during tests using the proofagent_run fixture.
Yes. MIT license. Open-source and free to use. No telemetry, no cloud, no signup required.
OpenAI, Anthropic, Google Gemini, Ollama (local), and any OpenAI-compatible API. You can also wrap any model's output in an LLMResult and test it without a provider adapter.
proofagent is pure Python with a pytest-style API. No YAML, no external CLI, no hosted platform. It's designed for agents, not just prompts — with first-class support for tool call testing, multi-step trajectory evaluation, and cost tracking. It also has zero telemetry and no vendor lock-in.
Promptfoo was acquired by OpenAI in March 2026. While it remains a capable tool, it is now owned by a model vendor, which creates a potential conflict of interest for unbiased evaluation. proofagent is MIT licensed, vendor-independent, and runs entirely on your machine. Your data never leaves your environment.
Yes. This is what makes proofagent different. You can assert that an agent called specific tools, didn't call others, and passed the right arguments:
expect(result).tool_calls_contain("search").no_tool_call("delete")
Yes. Capture the full trajectory (user messages, assistant responses, tool calls and results) in an LLMResult and assert on the sequence length, specific tool calls, and final output.
Use the proofagent_compare fixture or the CLI:
proofagent compare --model-a gpt-4.1-mini --model-b claude-sonnet-4-6 "Explain gravity"
This runs the same prompt on both models and shows a side-by-side comparison of output, cost, and latency.
Inline with .custom() or register reusable ones:
# Inline expect(result).custom("is_polite", lambda r: "please" in r.text.lower()) # Reusable from proofagent import register_assertion register_assertion("is_polite", lambda r: "please" in r.text.lower())
Use the built-in dataset loader with CSV or JSONL files:
from proofagent import Dataset dataset = Dataset.from_file("tests/data/prompts.jsonl") safety_cases = dataset.filter(tag="safety") sample = dataset.sample(n=10, seed=42)
Add two lines to your pipeline:
pip install "proofagent[all]" proofagent gate --min-score 0.95 --block-on-fail
If the pass rate drops below your threshold, the pipeline exits with code 1. Works with GitHub Actions, GitLab CI, CircleCI, and anything that runs shell commands.
proofagent dashboard --test tests/
This runs your tests and opens a local web UI showing pass/fail results, costs, and test descriptions.
Optional. Create proofagent.json or add [tool.proofagent] to your pyproject.toml to set defaults for provider, model, results directory, and minimum pass rate. Everything also works without any config file.
LLM outputs are non-deterministic. A prompt change, model upgrade, or temperature tweak can silently break safety guardrails, hallucinate facts, or triple your API costs. Without automated tests, you won't know until users report it.
You can't re-run 200 adversarial prompts by hand after every change. Automated tests run in seconds, catch regressions immediately, and produce a record you can review. Manual testing is for exploration. Automated testing is for confidence.
The EU AI Act requires organizations deploying high-risk AI to demonstrate testing and monitoring. proofagent can help you build a structured testing practice with auditable pass/fail records, but it is not a substitute for a comprehensive compliance program. Consult legal and compliance professionals for your specific obligations.
Anyone shipping code that calls an LLM — chatbots, RAG pipelines, AI agents, copilots, content generators, trading bots. If your application depends on model output, you should be testing that output.
proofagent is open source and maintained at github.com/camgitt/proofagent. Contributions welcome.
Open an issue on GitHub or read the source — it's under 2,000 lines.