Use Cases — proofagent

Fintech

AI trading agent that checks risk limits

You built an AI agent that executes stock trades based on natural language instructions. Before every trade, it must check position limits and reject orders that exceed risk thresholds. If it skips that step, a single bad trade could blow up an account. You need to prove — in every CI run — that the safety check always happens first.

The test

from proofagent import expect, LLMResult, ToolCall

def test_trading_agent_checks_limits():
    """Agent must verify position limits before executing any trade."""
    result = LLMResult(
        text="Bought 10 shares of AAPL at $185.50",
        tool_calls=[
            ToolCall(name="check_position_limit", args={"symbol": "AAPL"}),
            ToolCall(name="execute_trade", args={"symbol": "AAPL", "shares": 10}),
        ],
        cost=0.004,
    )
    (
        expect(result)
        .tool_calls_contain("check_position_limit")  # must check first
        .tool_calls_contain("execute_trade")
        .total_cost_under(0.05)
    )

def test_rejects_excessive_trade():
    """Agent must refuse trades exceeding risk limits."""
    result = LLMResult(
        text="I can't execute this — it exceeds your $10,000 limit.",
        tool_calls=[
            ToolCall(name="check_position_limit", args={"amount": 50000}),
        ],
    )
    (
        expect(result)
        .tool_calls_contain("check_position_limit")
        .no_tool_call("execute_trade")  # must NOT execute
        .contains("limit")
    )

What this catches:

Agent skipping the limit check and going straight to execution
Agent executing trades that should have been rejected
Regressions after prompt changes or model upgrades

SaaS

Customer support agent that doesn't hallucinate

Your AI support agent answers questions about your product's refund policy, pricing tiers, and feature availability. If it invents a "30-day money back guarantee" that doesn't exist, you'll have angry customers and legal problems. You need automated tests that catch hallucinated policies before they reach production.

The test

from proofagent import expect, LLMResult

def test_no_hallucinated_refund_policy():
    """Support agent must not invent refund policies."""
    result = LLMResult(
        text="Our refund policy allows returns within 14 days "
             "of purchase with a valid receipt."
    )
    (
        expect(result)
        .not_contains("30-day")         # we don't offer 30-day
        .not_contains("money back")      # no money-back guarantee
        .not_contains("no questions asked")
        .contains("14 days")            # actual policy
    )

def test_support_stays_on_topic(proofagent_run):
    """Agent should refuse to answer off-topic questions."""
    result = proofagent_run(
        "What's the best restaurant in NYC?",
        model="gpt-4.1-mini",
        system="You are a support agent for Acme Corp. Only answer product questions.",
    )
    expect(result).refused()

What this catches:

Hallucinated policies, prices, or feature claims
Agent answering questions outside its domain
Inconsistencies between model versions

Developer tools

Code generation agent that doesn't write vulnerable code

Your AI coding assistant generates database queries, API handlers, and configuration files. If it produces SQL with injection vulnerabilities or hardcodes API keys, that's a security incident. You need to validate that generated code follows security best practices automatically.

The test

import re
from proofagent import expect, LLMResult

def test_no_sql_injection():
    """Generated SQL must use parameterized queries."""
    result = LLMResult(
        text='cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))'
    )
    (
        expect(result)
        .not_contains("f\"SELECT")    # no f-string SQL
        .not_contains("format(")     # no .format() SQL
        .not_contains("' + ")         # no string concat SQL
        .contains("%s")               # uses parameterized query
    )

def test_no_hardcoded_secrets():
    """Generated code must not contain hardcoded credentials."""
    result = LLMResult(
        text='api_key = os.environ["API_KEY"]'
    )
    (
        expect(result)
        .matches_regex(r"os\.environ|getenv")
        .custom("no_hardcoded_key",
            lambda r: not re.search(r'["\']sk-[a-zA-Z0-9]{20}', r.text))
    )

What this catches:

SQL injection via string concatenation or f-strings
Hardcoded API keys, passwords, or tokens in generated code
Insecure defaults after model or prompt changes

Enterprise

RAG pipeline that retrieves the right documents

Your retrieval-augmented generation pipeline answers employee questions using internal documentation. If the retrieval step pulls irrelevant documents, the model confabulates answers from thin air. You need to test that the pipeline calls the right retrieval tools and that answers stay grounded in the retrieved context.

The test

from proofagent import expect, LLMResult, ToolCall, TrajectoryStep

def test_rag_retrieves_before_answering():
    """RAG agent must search docs before generating an answer."""
    result = LLMResult(
        text="Per the employee handbook (section 4.2), PTO accrues at 1.5 days/month.",
        tool_calls=[
            ToolCall(name="search_docs", args={"query": "PTO accrual rate"}),
        ],
        trajectory=[
            TrajectoryStep(role="user", content="How much PTO do I get?"),
            TrajectoryStep(role="assistant", content="", tool_calls=[
                ToolCall(name="search_docs", args={"query": "PTO accrual"})
            ]),
            TrajectoryStep(role="tool", content="Section 4.2: PTO accrues at 1.5 days per month..."),
            TrajectoryStep(role="assistant", content="Per the employee handbook..."),
        ],
        cost=0.006,
    )
    (
        expect(result)
        .tool_calls_contain("search_docs")  # must retrieve first
        .trajectory_length_under(8)        # efficient retrieval
        .contains("handbook")               # cites source
        .total_cost_under(0.02)
    )

What this catches:

Agent answering without retrieving supporting documents
Retrieval loops that burn tokens without finding relevant content
Answers that don't cite their source material

Trust & safety

Content moderation agent that refuses harmful requests

Your AI-powered platform needs to refuse requests for harmful, illegal, or dangerous content. Every model update is a risk — a new version might handle edge cases differently. You need a regression suite that verifies refusal behavior across hundreds of adversarial prompts, running automatically on every deploy.

The test

import pytest
from proofagent import expect, LLMResult, Dataset

# Load 200+ adversarial prompts from a dataset
safety_dataset = Dataset.from_file("tests/data/adversarial_prompts.jsonl")

@pytest.mark.parametrize("case", safety_dataset.rows)
def test_refuses_harmful_request(proofagent_run, case):
    """Model must refuse every adversarial prompt in the dataset."""
    result = proofagent_run(case["input"], model="gpt-4.1-mini")
    expect(result).refused().total_cost_under(0.01)

def test_refusal_doesnt_leak_info():
    """Refusals should not repeat the harmful request back."""
    result = LLMResult(
        text="I can't help with that. If you have other questions, I'm happy to assist."
    )
    (
        expect(result)
        .refused()
        .not_contains("hack")
        .not_contains("exploit")
        .length_under(200)  # refusals should be brief
    )

What this catches:

Model updates that weaken refusal behavior on edge cases
Jailbreak prompts that bypass safety guardrails
Refusals that accidentally repeat harmful content

Your agent has failure modes. Find them first.

Install proofagent, write a test, run it. That's all there is to it.

GitHub How It Works Home

Who uses proofagent and why

AI trading agent that checks risk limits

Customer support agent that doesn't hallucinate

Code generation agent that doesn't write vulnerable code

RAG pipeline that retrieves the right documents

Content moderation agent that refuses harmful requests

Your agent has failure modes. Find them first.