Exploring Jailbreak Attacks: Understanding LLM Vulnerabilities and the Challenges of Detection and Defense
The recent rise in jailbreaking methods has revealed the vulnerability of Large Language Models (LLMs) to malicious inputs. Jailbreaking differs from prompt injection attacks, although the terms are often used synonymously.
Prompt injection attacks aim to disrupt or alter an application’s intended functionality by manipulating input prompts, potentially leading to unexpected or harmful outputs. In contrast, a jailbreak attack focuses on bypassing the content safety policies and restrictions established by the model’s creators.
Both techniques exploit vulnerabilities in AI systems but target different aspects of model behavior and safeguards. While safeguards can complicate jailbreaking, hackers and hobbyists continually refine their prompt engineering to bypass defenses. Effective prompts are often shared online, creating an arms race where LLM developers enhance safeguards, and jailbreakers adapt their methods.
Additionally, prompt injections can sometimes be used to jailbreak an LLM, and jailbreaking tactics may facilitate successful prompt injections, as they share overlapping strategies. Jailbreaking techniques can also be exploited for privacy attacks, such as Membership Inference Attacks (MIA) or the exfiltration of personal information.
Why model is susceptible to jailbreak attacks
Despite strict safety alignment and red team procedures aimed at mitigating this vulnerability, new jailbreaking methods are continually being proposed.
While earlier research has primarily focused on increasing the success rates of these attacks, the fundamental mechanisms for protecting LLMs remain poorly understood. The study titled “Rethinking Jailbreaking through the Lens of Representation Engineering” investigates the vulnerabilities of safety-aligned LLMs by uncovering specific activity patterns within the representation space generated by these models. It highlights that the primary reason models refuse malicious queries with defensive responses (e.g., “I cannot provide …”) is that these queries trigger specific activation patterns within the models. These activation patterns are referred to as “safety patterns.” The safety patterns significantly contribute to the model’s ability to reject malicious queries. The robustness of LLMs against jailbreaking can be diminished or enhanced by attenuating or strengthening these identified safety patterns.
Jailbreak techniques
As LLMs become increasingly prevalent in real-world applications, research into jailbreaking has diversified, leading to a growing list of techniques. These techniques can be classified into three main categories: optimization-based techniques, side-channel communication, and distribution-based methods. Additionally, there are manual prompt attacks and automated attacks. Some techniques require black-box access to the model, while others necessitate white-box access.
These techniques can operate at various levels, including token, word, sentence, or prompt levels. Some methods involve using a proxy model, while others are based on gradient or heuristic approaches. We will examine a selection of these techniques to better understand the challenges of detecting and defending against such attacks.
Manual jailbreak attacks
Manual attacks are executed by human adversaries against applications integrated with LLMs. Examples can be found on community websites such as FlowGPT and JailbreakChat, which provide a large repository of manual attack prompts. Here, we offer a couple of examples that are still effective against ChatGPT.
Payload Splitting
Payload Splitting involves dividing the adversarial input into multiple parts, which are then combined and executed by the LLM.
Payload Splitting example for Scam Emails:

Virtualization
Virtualization entails “setting the scene” for the AI, akin to role prompting, which can simulate a specific task. Below is an example for a similar purpose as mentioned above.

Multilingual Prompts
In the paper Multilingual jailbreak challenges in large language models, the authors meticulously collect harmful queries in English and have them manually translated by native speakers into nine non-English languages, ranging from high-resource to low-resource. This effort led to the creation of the first multilingual jailbreak dataset, named MultiJail. The prompts in this dataset can be used directly for unintentional scenarios, while intentional scenarios are simulated by combining the prompts with malicious instructions in English.
Automated jailbreak attacks
To automate the generation of harmful prompts for jailbreaking, various tools have been developed. Numerous innovative techniques and algorithms have emerged, and the trend is on the rise. Here, we highlight a few of these methods with diagrams to provide a more intuitive understanding of jailbreak tactics.
Greedy Coordinate Gradient (GCG)
The alignment of large language models (LLMs) is crucial to ensure that these models behave in ways that are safe, ethical, and aligned with human values. However the alignment of large language models (LLMs) is not adversarially robust.
Researchers have developed an attack, Greedy Coordinate Gradient (GCG), that constructs a single adversarial prompt capable of consistently bypassing the alignment of state-of-the-art commercial models, such as ChatGPT, Claude, Bard, and Llama-2, without direct access to them.
In the paper [Universal and transferable adversarial attacks on aligned language models](https://arxiv.org/abs/2307.15043), adversarial suffixes are automatically produced using a Greedy Coordinate Gradient (GCG) algorithm. GCG construct attack suffixes by gradient optimization and postpend them into malicious queries to jailbreak models just as depicted in the below.
This adversarial prompt can induce harmful behaviors in these models with high probability, highlighting the potential for misuse. Greedy Coordinate Gradient (GCG), identifies universal and transferable prompts by optimizing against multiple smaller open-source LLMs for various harmful behaviors.

Full Completion Examples of Harmful Generations with adversarial prompt suffix.

Proxy-Guided Attack on LLMs (PAL)
The primary limitation of GCG is that it requires gradients, making it inapplicable to the proprietary LLMs are only accessible through an API. Therefore, there is a need for a method for evaluating the safety risks of proprietary LLMs behind black-box APIs.
The Proxy-Guided Attack on LLMs (PAL) is a query-based jailbreaking algorithm targeting black-box LLM APIs. It employs token-level optimization, guided by an open-source proxy model.
This attack is based on two key insights: First, gradients from an open-source proxy model are utilized to guide the optimization process, thereby reducing the number of queries to the target LLM. Second, a new loss function, specifically designed for commercial LLM APIs, is employed.
The Proxy-guided attack integrates aspects of a query-based attack, where the attacker iteratively queries the target model to enhance their attack. It also includes elements of a transfer attack, where an attack developed on one model is transferred to another model, and a model extraction attack, which seeks to replicate the behaviors of a proprietary model.
The main steps of PAL algorithm are:
1. Initialize the adversarial suffix
2. Use the proxy model to compute gradients and evaluate candidates
3. Query the target model with the K candidates from the previous step to obtain the target model’s loss and response
4. Optional fine-tune the proxy model on the response generated by the target model. The goal is to make the proxy model even more similar to the target model locally
5. Select the best candidate for the next step

Successful APL attack on GPT-3.5-Turbo-0613:

This approach surpasses the performance of state-of-the-art red-teaming LLMs while being more cost-effective.
Rewriting and Nesting for Large Language Models (ReNeLLM)
The aforementioned methods require substantial time to identify the optimal suffix and demonstrate lower efficacy on some commercial LLMs. To address these limitations, ReNeLLM, introduced in A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily rewrites malicious queries at sentence or word levels and nests them into specific scenario templates, resulting in significant performance improvements.
ReNeLLM (Rewriting and Nesting for Large Language Models) is an automatic framework, that leverages LLMs themselves to generate effective jailbreak prompts.
ReNeLLM consists of two primary steps:
1. Prompt Rewriting: This step involves various rewriting operations on the initial prompt without altering its core meaning. Techniques such as paraphrasing with fewer words or changing the expression style are used to make it easier to elicit a response from LLMs.
2. Scenario Nesting: To make the rewritten prompts more covert, they are embedded into specific task scenarios (e.g., code completion, text continuation, table filling). This engages the LLMs to identify effective jailbreak attack prompts.
The below is an example of jailbreak prompt from ReNeLLM, which employs prompt rewriting and scenario nesting (in this case, code completion), leads the LLM to produce unsafe responses.

Overview of the ReNeLLM framework:
The final jailbreak prompt is created in two stages, beginning with an initial prompt. The first stage is prompt rewriting, where various rewriting techniques are applied to the initial prompt while maintaining its core meaning. The second stage involves scenario nesting, which improves stealth by integrating the rewritten prompt into three universal task scenarios.
This entire process is automated and does not necessitate any further training or optimization.

CipherChat
There is another significant development in the realm of adversarial attacks is CipherChat. The authors of GPT-4 IS TOO SMART TO BE SAFE: STEALTHY CHAT WITH LLMS VIA CIPHER introduce a new framework, CipherChat, that allows humans to interact with LLMs using cipher prompts enhanced by system role descriptions and few-shot enciphered examples.
Conversing with ChatGPT through ciphers may result in unsafe behaviors.

Overview of CipherChat:
The process consists of three main steps: constructing the system prompt, enciphering the input instructions, and deciphering the LLM’s responses. The core concept is to prevent the LLM from engaging with any natural language, restricting it to processing encrypted inputs and generating encrypted outputs, thereby bypassing the safety alignment.

Persuasive Adversarial Prompt (PAP)
The Persuasive Adversarial Prompt (PAP) attack in How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs is a novel method for jailbreaking large language models (LLMs) by leveraging persuasive techniques.
Unlike traditional algorithm-focused attacks, PAP uses a taxonomy of persuasion derived from social science research to craft adversarial prompts that can bypass the safety mechanisms of LLMs. This approach treats LLMs as human-like communicators, making the attack more effective.
The study, as demonstrated below, introduces a taxonomy-guided approach for systematically creating human-readable persuasive adversarial prompts (PAP). This method aims to improve the understanding of the risks linked to human-like communication. The persuasion taxonomy is designed to connect social science with AI safety research, establishing a foundation for future investigations to more effectively explore the safety risks that everyday users may face.

Overview of the Taxonomy-Guided Persuasive Adversarial Prompt (PAP) Generation Method:
A. Training the Persuasive Paraphraser:
1. Data Acquisition: Various techniques (such as in-context prompting and fine-tuning a paraphraser) along with the persuasion taxonomy are utilized to transform plain harmful queries into high-quality PAP for training purposes.
2. Fine-Tuning: This training data is then used to refine a persuasive paraphraser that can reliably paraphrase harmful queries.
B. Deploying the Persuasive Paraphraser:
1. PAP Generation: The fine-tuned persuasive paraphraser generates PAP for new harmful queries, employing a designated persuasion technique.
2. Evaluation: A GPT4-Judge is used to evaluate the harmfulness of the outputs produced by the target model.

Final words
The attack techniques mentioned are just the tip of the iceberg, as individuals continue to devise creative attacks that often exceed our expectations. The emergence of numerous novel jailbreak techniques presents a significant challenge for detection and defense.
While there are some methods available for detecting adversarial prompts, such as perplexity filters, and defensive strategies like paraphrasing, retokenization, and semantic smoothing, no comprehensive solutions exist that can tackle all jailbreak detection challenges, even when various methods are combined.
Understanding the inner workings of large language models (LLMs) is crucial for effectively countering these attacks, yet this remains a considerable concern for us. A clear comprehension of how LLMs operate is essential for developing robust defenses and strategies to mitigate the risks posed by these evolving threats.
This underscores the necessity for ongoing research and collaboration in the field to strengthen the capabilities in detecting and defending against jailbreak attempts.