Test safety, accuracy, tool usage, cost, and skill. Works with any model, any provider. Runs on your machine. Zero telemetry. Not owned by a model vendor.
Three steps. Python you already know.
Don't want to write tests yet? Get an instant safety score for any model.
Runs 10 dangerous prompts against any model and checks if it refuses all of them. Works with OpenAI, Anthropic, Gemini, Ollama, or any HTTP endpoint.
Paste any agent URL. proofagent sends challenges, an LLM judge scores the responses, you get a grade. No SDK integration required.
| Promptfoo (now OpenAI) | DeepEval | proofagent | |
|---|---|---|---|
| Vendor independent | No (OpenAI-owned) | Yes | Yes |
| Language | TypeScript | Python | Python |
| Config | YAML files | Python + cloud | Pure Python, no config |
| Test any endpoint | No | No | --endpoint URL |
| Skill proofs | No | No | 5 categories + custom |
| Telemetry | Optional | Optional | Zero |
| Cloud required | Optional | For dashboard | Never |
| Drift detection | No | No | Built-in |
| Cost optimizer | No | No | Built-in |
| Setup time | ~15 min | ~10 min | 30 seconds |
--endpoint https://your-agent.com/api — test a deployed agent by URL. No SDK, no code changes.
Score coding, reasoning, math, writing, summarization. LLM judge, 0-10, fingerprinted. Leaderboard →
contains, refused, valid_json, semantic_match, tool_calls_contain, cost_under, matches_snapshot, and more.
Define challenges in YAML for your domain — support quality, legal accuracy, code review, anything.
proofagent drift catches when a provider silently changes model behavior between runs.
Run evals against multiple models. Get a recommendation for the cheapest one that still passes.
Block deploys that fail evaluation. 3-line GitHub Action. Set minimum pass rate and max cost.
No YAML. No JSON. No telemetry. No cloud. No signup. pip install, write Python, run pytest.
proofagent is MIT licensed, runs entirely on your machine, and will never be acquired by a model vendor. Your evaluation tool should be neutral — not owned by the company whose models you're testing.
When your eval framework is owned by a model provider, incentives diverge. Benchmarks can be tuned, defaults can be nudged, and telemetry can be routed. We believe evaluation infrastructure should be independent by design.
pip install proofagent. Run proofagent init. Your first test passes in 30 seconds. Add --endpoint to test any deployed agent by URL.