How to Test Your AI Agent for Safety
You built an AI agent. It answers questions, writes code, maybe even takes actions on behalf of users. But have you tested what happens when someone asks it to do something it shouldn't? Most developers ship AI features without any systematic safety testing. That's a liability waiting to happen.
This guide walks through the three categories of safety tests every AI agent needs, and shows you how to run them automatically with proofagent.
Why safety testing matters
Large language models are instruction-following machines. That's their strength and their risk. Without guardrails, a model will comply with harmful requests, generate biased outputs, or confidently state fabricated facts. Your agent inherits all of these failure modes.
The consequences are real. A customer-facing chatbot that generates harmful content becomes a PR crisis. An agent that hallucinates medical or legal advice creates liability. An AI coding assistant that produces insecure code becomes a supply chain vulnerability.
Safety testing isn't optional anymore. It's the same category as writing tests for your authentication system — something you'd never skip in production.
The 3 types of safety tests
1. Refusal tests
Refusal tests check whether your agent appropriately declines harmful requests. These include prompts asking the model to generate dangerous content, assist with illegal activities, or bypass its own safety guidelines. A passing refusal test means the agent said no when it should have.
proofagent ships with 19 standardized refusal assertions covering categories like violence, self-harm, illegal activity, and prompt injection attempts.
2. Bias tests
Bias tests evaluate whether your agent produces outputs that unfairly favor or discriminate against particular groups. These tests send pairs of identical prompts with only demographic details changed, then compare the responses for consistency. Significant variation signals bias in the model's behavior.
3. Hallucination tests
Hallucination tests probe whether your agent fabricates information. These include asking about non-existent entities, requesting citations for claims, and testing whether the model acknowledges uncertainty. A model that confidently invents facts is more dangerous than one that says "I don't know."
Running safety tests with proofagent
Install proofagent and run your first safety scan in under a minute:
pip install proofagent
# Run the full safety suite against your agent
proofagent scan --provider openai --model gpt-4o
# Run only refusal tests
proofagent scan --provider openai --model gpt-4o --category refusal
# Test a local model
proofagent scan --provider ollama --model llama3
The scan runs each assertion against your model and produces a structured report. Each test gets a pass, fail, or warn status with the model's raw response for review.
# Example output
proofagent scan results
=======================
Model: gpt-4o (openai)
Tests: 19 | Pass: 17 | Fail: 1 | Warn: 1
PASS refuses-violence
PASS refuses-illegal-instructions
FAIL refuses-prompt-injection-v2
WARN acknowledges-uncertainty
PASS refuses-self-harm-content
...
Every result includes the exact prompt sent and the full model response, so you can audit failures yourself rather than trusting a black-box score.
Adding safety tests to CI/CD
Safety tests belong in your continuous integration pipeline. If someone changes the system prompt, swaps the model, or adjusts temperature settings, safety properties can regress. Catch it before it ships.
# .github/workflows/safety.yml
name: AI Safety Tests
on: [push, pull_request]
jobs:
safety:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install proofagent
- run: proofagent scan --provider openai --model gpt-4o --ci
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
The --ci flag outputs results in a machine-readable format and returns a non-zero exit code on any failures, so your pipeline fails automatically when safety tests don't pass.
You can also run targeted scans on pull requests that modify AI-related code, keeping feedback loops tight without burning tokens on every commit.
Submitting results to the leaderboard
Once your agent passes its safety tests, you can optionally submit the results to the proofagent safety leaderboard. This gives your users and stakeholders a public, verifiable record of your agent's safety properties.
# Submit results after a passing scan
proofagent scan --provider openai --model gpt-4o --submit
The leaderboard shows pass rates across all safety categories, making it easy to compare models and track improvements over time.
What to do next
Start with a single command. Run proofagent scan against whatever model you're using in production. Look at what fails. Fix the failures by adjusting your system prompt, adding guardrails, or switching models. Then add it to CI so it never regresses.
Safety testing isn't a one-time audit. It's an ongoing practice, just like any other form of software testing. The difference is that the consequences of skipping it are uniquely severe when AI is involved.