Traditional pen testing doesn't cover agent-specific attack vectors. You can't nmap an agent's reasoning chain. You need tools built for this — and a methodology that tests each kill chain stage systematically.
Assessment & Red Teaming · security-teams From AgentDojo (ETH Zurich) — the most comprehensive agent security benchmark available. These numbers show why agent security testing requires dedicated tools.
Each serves a different purpose. Use them together for coverage.
97 tasks and 629 security test cases across email, Slack, banking, and travel agent suites. Tests indirect prompt injection — malicious text in tool outputs that causes unauthorized actions. Measures both attack success rate AND utility degradation.
Microsoft's open-source framework (v0.11.0, 3.4k stars). 20+ reusable attack strategies including single-turn, multi-turn, Crescendo (gradual escalation), and Tree of Attacks with Pruning (TAP). Now integrated into Azure AI Foundry as the "AI Red Teaming Agent."
NVIDIA's scanner — "nmap for LLMs." Four-component architecture: generators (connect to target), probes (attack vectors), detectors (classify responses), reporting (structured output). Covers prompt injection, jailbreaking, data leakage, toxicity, hallucination.
Meta's benchmark suite under Purple Llama (v1-v4). v3 is most agent-relevant: tests whether LLMs can autonomously execute offensive security operations. v4 adds CyberSOCEval (with CrowdStrike) for SOC automation testing and AutoPatchBench for vulnerability patching.
Meta's lightweight classifier for real-time prompt injection detection. Prompt Guard 2 (86M mDeBERTa-base / 22M DeBERTa-xsmall) uses binary classification (benign vs. malicious). The original Prompt Guard v1 used three classes (benign, injection, jailbreak). Deployable on CPU for real-time filtering. Fine-tunable to your data.
Researchers have demonstrated that Prompt Guard itself is vulnerable to prompt injection attacks — the detector can be bypassed. This is a defense-in-depth layer, not a silver bullet.
Center for AI Safety benchmark. The largest-scale comparison: 18 red teaming methods tested against 33 target LLMs and defenses. Key finding: no current attack or defense is uniformly effective. Robustness is independent of model size.
The most comprehensive methodology available (May 2025, 50+ contributors from CSA and OWASP). 12 threat categories including Agent Authorization Hijacking, Goal Manipulation, Multi-Agent Exploitation, Memory Poisoning, and Supply Chain Attacks. Five-phase process: preparation, execution, analysis, reporting — applied across all 12 categories.
This is the closest thing to a standard methodology for agent red teaming that exists today.
Released December 2025, 100+ contributors. Ten agent-specific risks: ASI01 (Agent Goal Hijack) through ASI10 (Rogue Agents). Covers identity, tools, delegated trust boundaries, and autonomous operation risks. Use this as your risk checklist — each ASI maps to specific test cases.
No claim without verifiable evidence — file path, command output, reproducible steps. Minimum evidence bar: 10+ findings, at least 1 architecture-level and 1 process-level finding. Severity rubric: Critical (data loss, account takeover confirmed by code path analysis), High (data exposure possible but mitigated), Medium (configuration weakness), Low (style deviation). Every finding must include impact, likelihood, evidence, fix, and validation steps.
This isn't a published standard — it's how I run security assessments. The point is that red teaming without evidence is just a conversation.
Map each kill chain stage to the right tool and test.
| Kill Chain Stage | What to Test | Tool |
|---|---|---|
| 01 RECON | Can the agent's tools, permissions, and system prompt be extracted? | Manual probing + Garak |
| 02 INJECT | Can indirect injection via tool responses change agent behavior? | AgentDojo + PyRIT |
| 03 HIJACK | Can the agent's goal be substituted through multi-turn conversation? | PyRIT (Crescendo, TAP) + CyberSecEval v3 |
| 04 ESCALATE | Can the agent access tools beyond its intended scope? | Manual + hook bypass testing |
| 05 EXFIL | Can the agent leak data through legitimate channels? | AgentDojo + manual output review |
| 06 PERSIST | Can memory or config files be poisoned for future sessions? | Manual + MCP security checks |
Red teaming is how you validate the kill chain's defensive controls. Combine it with hook-based guardrails (prevention) and MCP security (tool-layer defense) for defense in depth.
Detection patterns, governance guides, and more practitioner content coming.
This work represents the author's independent research and personal views. It is not related to or endorsed by the author's employer.