LLM Jailbreak (AML.T0054)

Description

Adversaries may induce a large language model (LLM) to ignore, circumvent, or override its safety/alignment behaviors and/or guardails to elicit outputs the model is intended to withhold. Once jailbroken, the LLM may be used in unintended ways by the adversary. Jailbreaks may be achieved via adversarial prompting, or by modifying model weights or safety mechanisms.

Adversaries may attempt a jailbreak for Defense Evasion of the LLM’s guidelines and guardrails itself to then reveal information (ex: LLM Data Leakage, Discover LLM System Information) or generate harmful content (ex: Generate Malicious Commands, Spearphishing via Social Engineering LLM). They may also jailbreak a model for Privilege Escalation to invoke tools or perform actions for their own purposes (ex: AI Agent Tool Invocation) or abuse the agent for a Command and Control channel (ex: AI Agent).

Adversaries use a variety of strategies to craft jailbreak prompts. Prompts may target specific models or model families and are iterated upon until successful. Model providers actively update their model guardrails to make them more resistant to jailbreak prompts as new prompts are developed. Common strategies [1] include but are not limited to:

Instruction override: Use phrasing that attempts to supersede prior constraints (e.g. “ignore previous instructions”).
Roleplay / persona switching: Instruct the LLM to adopt an identity or mode that allows unrestricted answers (e.g. “as a security researcher”).
Fictionalization and hypotheticals: Instruct the LLM to include disallowed content as part of a story, screenplay, or educational scenario.
Separate intent from content: request analysis, examples, templates, or edge cases, that implicitly contain disallowed content.
Multi-turn escalation / Crescendo: Utilize a sequence of prompts that start benign, establish trust, then gradually cross policy boundaries with incremental prompts.
Constrained output formats: Instruct the LLM to output to a strict schema or format (e.g. JSON, YAML, code, or tables).
Obfuscation and transformation: Use encoding, transformations, translation, or euphemisms, (e.g., base64 encoding, “describe it in another language”).
Create a high priority objective: Frame compliance as necessary to fulfill the user’s main task (e.g. “to complete the evaluation,” “to follow the spec,” “to follow safety guidelines”).

Adversaries may also use algorithmic approaches to generating jailbreak prompts [2] [3]. Algorithmic jailbreak generation allows for automated methods that discover jailbreaks at scale. Some approaches automate manual strategies [4] [5] [6] [7] while others optimize a string of tokens directly [8] to produce nonsensical text. Both black-box (applicable to commercial models where the adversary has only query access to the model) and white-box (applicable in the open-source setting, where the adversary has full access to the model weights) optimization approaches are viable.

Adversaries may also directly manipulate a model’s weights, or modify or remove parts of a model to create a jailbroken of “uncensored” variant of the target model. This is applicable to open-source models, or cases where the adversary gains full access to the target model. Approaches include fine-tuning to reduce refusals [9], targeted model editing [10], addition of adapters [11], and removing safety mechanisms such as guardrails.

Jailbreak prompts that are known to work on various classes of LLMs are often published in the open-source community [12]. Jailbroken or uncensored LLMs that have been trained or fine-tuned to be jailbroken are shared in public model registries such as huggingface [13].

Description

How GTK Cyber trains on this

Related techniques

Train your team on real adversarial-AI attacks.