What is the difference between AI red-teaming and using AI for red-teaming?

AI red-teaming means adversarially testing AI systems for vulnerabilities and misuse paths. Using AI for red-teaming means using AI tools to support traditional offensive security operations. The confusion between the two is part of why ownership debates stall: AI red-teaming is a security discipline, not an AI engineering function.

Which MITRE ATLAS techniques should a security team prioritize when starting AI red-teaming?

Start with AML.T0051 (LLM Prompt Injection) and AML.T0054 (LLM Jailbreak) because they apply to almost every LLM deployment. From there, prioritize based on what the deployed system can do: if an AI agent has tool access (API calls, database queries, email), prompt injection combined with excessive agency (OWASP LLM08) is the highest-impact attack class. For systems with RAG pipelines, indirect injection via poisoned retrieval content is the most realistic external threat. Model extraction (AML.T0013) and model evasion (AML.T0015) are higher-value targets for ML-based detection systems than for general-purpose LLMs.

What tools do security teams use for AI red-teaming?

Garak (github.com/NVIDIA/garak) is an LLM vulnerability scanner that probes for prompt injection, jailbreaks, and data leakage. Run garak --model_type openai --model_name gpt-4o --probes promptinjection to test an OpenAI-compatible endpoint in minutes. PyRIT (github.com/Azure/PyRIT) is Microsoft's Python toolkit for structured multi-turn red team campaigns with scoring and reporting. For RAG and indirect injection testing, HouYi (github.com/LLMSecurity/HouYi) provides systematic tooling. The combination of automated probing with Garak and manual architectural testing covers both systematic and structural weaknesses.

What AI knowledge does a security practitioner need before starting AI red-teaming?

You need to understand how LLMs structure input: the role of the system prompt, how user messages and retrieved content are assembled in the context window, and how the model prioritizes competing instructions. For RAG systems, understand how documents are chunked, indexed, and injected into context. If the system has tool use, understand what tools are granted and what permissions they carry. You do not need to understand model training, loss functions, or neural network architecture to red-team LLM applications.

Why Security Teams Should Own AI Red-Teaming

The debate about who owns AI red-teaming usually gets settled by org chart proximity: the AI team built the system, so the AI team should test it. That logic produces the wrong answer.

AI red-teaming belongs to the security team. Not because security practitioners know more about machine learning, but because they already have what is hardest to teach: an adversarial mindset built around finding how systems fail when someone actively tries to break them.

What AI Red-Teaming Actually Is

AI red-teaming is adversarial testing with a different target surface. The question isn’t whether the system performs well. It’s what an attacker can make the system do that the developer didn’t intend.

That framing is identical to any red team engagement. Find the trust boundaries. Identify inputs the developer assumed would be well-formed. Submit inputs they didn’t anticipate. Probe the gap between “this system should never do X” and “here is the condition under which it does.”

The vocabulary is different. The attack surface is different. The thought process is not.

Why the AI Team Defaults to the Wrong Questions

AI engineers optimize for capability. They measure success by how well the system answers questions, generates content, or takes actions. That’s the right optimization for building.

Adversarial testing requires a different metric: how badly does the system fail when someone deliberately tries to break it? AI teams testing their own models tend to evaluate safety policy boundaries: will the model produce harmful content? That’s a meaningful question. It’s not the right starting question for a security evaluation.

Security teams ask the second set of questions naturally: can an attacker use this model to exfiltrate data from the retrieval pipeline? Can injected instructions in a document cause the agent to take unauthorized actions? Can a low-frequency attacker stay inside the system’s statistical baseline long enough to extract something valuable?

This isn’t a criticism of AI teams. You don’t ask a software developer to QA their own code for injection vulnerabilities either. The skills overlap; the incentive structure doesn’t.

What Security Teams Already Have

Threat modeling transfers directly. An attacker embedding malicious instructions in a document retrieved by an LLM (MITRE ATLAS AML.T0051) is exploiting a data-flow trust boundary. A security engineer who has modeled SQL injection attack chains, XML external entity attacks, or server-side request forgery will recognize the underlying pattern immediately. The specific syntax differs. The analysis model does not.

Lateral movement intuition applies to agent deployments. If an LLM with tool access can be prompted into calling an API it shouldn’t call, that’s a privilege escalation path. If it can be prompted into sending email on the user’s behalf, that’s an action the attacker controls without direct system access. Security practitioners recognize these as classical access control failures.

Supply chain thinking applies to RAG pipelines. Which external data sources does the system retrieve from? Who can write to those sources? Can an attacker introduce content that shifts the model’s behavior when processed? These are supply chain trust questions security teams have been asking about software dependencies for years.

The OWASP Top 10 for LLM Applications covers prompt injection (LLM01), insecure output handling (LLM02), and excessive agency (LLM08). A practitioner familiar with the OWASP Web Application Security Testing Guide will recognize the vulnerability patterns under different names.

The Specific Knowledge Gap

The argument isn’t that security teams need no AI education. They need specific education. The gap is bounded:

LLM context structure: How system prompts, user messages, and retrieved content are assembled into the model’s context window. Understanding this is required for designing injection payloads and predicting how the model will prioritize competing instructions.
RAG architecture: How retrieval-augmented generation systems index, chunk, and inject content into context. Any content indexed from an uncontrolled external source is a potential injection vector. The attack surface of a RAG deployment is fundamentally different from a pure-inference deployment.
Tool use and agent permissions: When a model can call APIs, query databases, or execute code, the output is executable. The security stakes scale directly with the permissions granted to those tools.
Probabilistic evaluation methodology: LLM outputs are non-deterministic. A finding that works 4 out of 10 attempts is still a finding. PyRIT (Microsoft’s Python Risk Identification Toolkit) structures multi-turn attacks and scores results across runs. Garak (NVIDIA’s LLM vulnerability scanner) automates probe sets for prompt injection, jailbreaks, and data leakage.

None of this requires a machine learning background. It requires understanding system architecture well enough to reason about the attack surface. Security teams do that routinely for systems they didn’t build.

Where to Start

Pick one AI deployment in your environment. Document its architecture: which model, what system prompt, what retrieval sources, what tool permissions. Build a scope document the way you would for any red team engagement.

Start with prompt injection. Run:

garak --model_type openai --model_name gpt-4o --probes promptinjection

Against any OpenAI-compatible endpoint, this runs a series of injection probes and returns which categories succeed. That gives you a baseline before you write a single custom payload.

Map your findings to MITRE ATLAS. The taxonomy covers adversarial techniques targeting ML systems: prompt injection (AML.T0051), jailbreaks (AML.T0054), model extraction (AML.T0013), data poisoning (AML.T0020). Tracking findings to ATLAS gives you a structured way to communicate scope and coverage to stakeholders, the same way MITRE ATT&CK does for traditional red team reports.

GTK Cyber’s AI red-teaming training is built specifically for security practitioners, starting from the adversarial mindset they already have and covering the LLM attack surface and tooling that’s new to them.