How to Test Your AI Agent for Safety

2026-03-12

You built an AI agent. It answers questions, writes code, maybe even takes actions on behalf of users. But have you tested what happens when someone asks it to do something it shouldn't? Most developers ship AI features without any systematic safety testing. That's a liability waiting to happen.

This guide walks through the three categories of safety tests every AI agent needs, and shows you how to run them automatically with proofagent.

Why safety testing matters

Large language models are instruction-following machines. That's their strength and their risk. Without guardrails, a model will comply with harmful requests, generate biased outputs, or confidently state fabricated facts. Your agent inherits all of these failure modes.

The consequences are real. A customer-facing chatbot that generates harmful content becomes a PR crisis. An agent that hallucinates medical or legal advice creates liability. An AI coding assistant that produces insecure code becomes a supply chain vulnerability.

Safety testing isn't optional anymore. It's the same category as writing tests for your authentication system — something you'd never skip in production.

The 3 types of safety tests

1. Refusal tests

Refusal tests check whether your agent appropriately declines harmful requests. These include prompts asking the model to generate dangerous content, assist with illegal activities, or bypass its own safety guidelines. A passing refusal test means the agent said no when it should have.

proofagent ships with 19 standardized refusal assertions covering categories like violence, self-harm, illegal activity, and prompt injection attempts.

2. Bias tests

Bias tests evaluate whether your agent produces outputs that unfairly favor or discriminate against particular groups. These tests send pairs of identical prompts with only demographic details changed, then compare the responses for consistency. Significant variation signals bias in the model's behavior.

3. Hallucination tests

Hallucination tests probe whether your agent fabricates information. These include asking about non-existent entities, requesting citations for claims, and testing whether the model acknowledges uncertainty. A model that confidently invents facts is more dangerous than one that says "I don't know."

Running safety tests with proofagent

Install proofagent and run your first safety scan in under a minute:

pip install proofagent

# Run the full safety suite against your agent
proofagent scan --provider openai --model gpt-4o

# Run only refusal tests
proofagent scan --provider openai --model gpt-4o --category refusal

# Test a local model
proofagent scan --provider ollama --model llama3

The scan runs each assertion against your model and produces a structured report. Each test gets a pass, fail, or warn status with the model's raw response for review.

# Example output
proofagent scan results
=======================
Model: gpt-4o (openai)
Tests: 19 | Pass: 17 | Fail: 1 | Warn: 1

 PASS  refuses-violence
 PASS  refuses-illegal-instructions
 FAIL  refuses-prompt-injection-v2
 WARN  acknowledges-uncertainty
 PASS  refuses-self-harm-content
 ...

Every result includes the exact prompt sent and the full model response, so you can audit failures yourself rather than trusting a black-box score.

Adding safety tests to CI/CD

Safety tests belong in your continuous integration pipeline. If someone changes the system prompt, swaps the model, or adjusts temperature settings, safety properties can regress. Catch it before it ships.

# .github/workflows/safety.yml
name: AI Safety Tests
on: [push, pull_request]

jobs:
  safety:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install proofagent
      - run: proofagent scan --provider openai --model gpt-4o --ci
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

The --ci flag outputs results in a machine-readable format and returns a non-zero exit code on any failures, so your pipeline fails automatically when safety tests don't pass.

You can also run targeted scans on pull requests that modify AI-related code, keeping feedback loops tight without burning tokens on every commit.

Submitting results to the leaderboard

Once your agent passes its safety tests, you can optionally submit the results to the proofagent safety leaderboard. This gives your users and stakeholders a public, verifiable record of your agent's safety properties.

# Submit results after a passing scan
proofagent scan --provider openai --model gpt-4o --submit

The leaderboard shows pass rates across all safety categories, making it easy to compare models and track improvements over time.

What to do next

Start with a single command. Run proofagent scan against whatever model you're using in production. Look at what fails. Fix the failures by adjusting your system prompt, adding guardrails, or switching models. Then add it to CI so it never regresses.

Safety testing isn't a one-time audit. It's an ongoing practice, just like any other form of software testing. The difference is that the consequences of skipping it are uniquely severe when AI is involved.