Basics

What is proofagent?

An open-source Python framework for testing AI agents. You write assertions against LLM outputs — text content, tool calls, cost, latency, multi-step trajectories — and run them with pytest. Think of it as unit tests for AI.

How do I install it?

pip install proofagent

For provider support: pip install "proofagent[openai]", "proofagent[anthropic]", "proofagent[gemini]", or "proofagent[all]".

How do I write my first test?

from proofagent import expect, LLMResult

def test_output():
    result = LLMResult(text="The capital of France is Paris.")
    expect(result).contains("Paris")

Save as test_example.py, run pytest. That's it.

Do I need an API key?

No. You can test entirely offline by constructing LLMResult objects directly. API keys are only needed if you want to call a live model during tests using the proofagent_run fixture.

Is it free?

Yes. MIT license. Open-source and free to use. No telemetry, no cloud, no signup required.

How it works

What providers does it support?

OpenAI, Anthropic, Google Gemini, Ollama (local), and any OpenAI-compatible API. You can also wrap any model's output in an LLMResult and test it without a provider adapter.

How is it different from Promptfoo or DeepEval?

proofagent is pure Python with a pytest-style API. No YAML, no external CLI, no hosted platform. It's designed for agents, not just prompts — with first-class support for tool call testing, multi-step trajectory evaluation, and cost tracking. It also has zero telemetry and no vendor lock-in.

Promptfoo was acquired by OpenAI in March 2026. While it remains a capable tool, it is now owned by a model vendor, which creates a potential conflict of interest for unbiased evaluation. proofagent is MIT licensed, vendor-independent, and runs entirely on your machine. Your data never leaves your environment.

Can I test tool calls?

Yes. This is what makes proofagent different. You can assert that an agent called specific tools, didn't call others, and passed the right arguments:

expect(result).tool_calls_contain("search").no_tool_call("delete")

Can I test multi-step agent workflows?

Yes. Capture the full trajectory (user messages, assistant responses, tool calls and results) in an LLMResult and assert on the sequence length, specific tool calls, and final output.

How do I compare two models?

Use the proofagent_compare fixture or the CLI:

proofagent compare --model-a gpt-4.1-mini --model-b claude-sonnet-4-6 "Explain gravity"

This runs the same prompt on both models and shows a side-by-side comparison of output, cost, and latency.

How do I add custom assertions?

Inline with .custom() or register reusable ones:

# Inline
expect(result).custom("is_polite", lambda r: "please" in r.text.lower())

# Reusable
from proofagent import register_assertion
register_assertion("is_polite", lambda r: "please" in r.text.lower())

How do I load test cases from a file?

Use the built-in dataset loader with CSV or JSONL files:

from proofagent import Dataset

dataset = Dataset.from_file("tests/data/prompts.jsonl")
safety_cases = dataset.filter(tag="safety")
sample = dataset.sample(n=10, seed=42)
CI/CD & production

How do I use it in CI/CD?

Add two lines to your pipeline:

pip install "proofagent[all]"
proofagent gate --min-score 0.95 --block-on-fail

If the pass rate drops below your threshold, the pipeline exits with code 1. Works with GitHub Actions, GitLab CI, CircleCI, and anything that runs shell commands.

How do I see results in a dashboard?

proofagent dashboard --test tests/

This runs your tests and opens a local web UI showing pass/fail results, costs, and test descriptions.

Can I configure it?

Optional. Create proofagent.json or add [tool.proofagent] to your pyproject.toml to set defaults for provider, model, results directory, and minimum pass rate. Everything also works without any config file.

Why

Why should I test my AI agent?

LLM outputs are non-deterministic. A prompt change, model upgrade, or temperature tweak can silently break safety guardrails, hallucinate facts, or triple your API costs. Without automated tests, you won't know until users report it.

Why not just test manually?

You can't re-run 200 adversarial prompts by hand after every change. Automated tests run in seconds, catch regressions immediately, and produce a record you can review. Manual testing is for exploration. Automated testing is for confidence.

Does this help with compliance?

The EU AI Act requires organizations deploying high-risk AI to demonstrate testing and monitoring. proofagent can help you build a structured testing practice with auditable pass/fail records, but it is not a substitute for a comprehensive compliance program. Consult legal and compliance professionals for your specific obligations.

Who is this for?

Anyone shipping code that calls an LLM — chatbots, RAG pipelines, AI agents, copilots, content generators, trading bots. If your application depends on model output, you should be testing that output.

Who built this?

proofagent is open source and maintained at github.com/camgitt/proofagent. Contributions welcome.

Still have questions?

Open an issue on GitHub or read the source — it's under 2,000 lines.