
In a groundbreaking discovery, the AI lab Anthropic has unveiled a new technique called “Many-shot jailbreaking” that can bypass the safety measures of large language models (LLMs). The paper, which has been making waves in the AI community, details how this method can manipulate AI responses and potentially lead to harmful consequences.
Many-Shot Jailbreaking: A New Threat to AI Safety
Many-shot jailbreaking is a technique that involves inserting a series of fabricated dialogues into the input to exploit the LLMs' in-context learning abilities. This feature enables LLMs to understand and apply new information or instructions presented within the prompt itself without any additional training or external data. However, the researchers at Anthropic have found that this learning method is a double-edged sword, making the models susceptible to manipulation through precisely crafted sequences of dialogues.
The discovery of many-shot jailbreaking is significant, especially as the capabilities of AI models such as Anthropic's Claude 3 become increasingly sophisticated. The researchers decided to publicize their findings due to a commitment to collective security improvement and to accelerate the development of strategies to counteract such vulnerabilities.
Read the complete Paper here
How Does Many-Shot Jailbreaking Work?
Many-shot jailbreaking capitalizes on the expanded context windows of LLMs. The context window refers to the maximum amount of text, measured in tokens, that the model can consider at one time when generating a response. Over the course of 2023, the context window of many LLMs has increased from around 4000 tokens to 10 million tokens, allowing users to draft significantly longer prompts and creating a new surface for attack.
The technique works by priming a model with a large number of harmful question-answer pairs and posing the intended question at the end. For instance, if an LLM is asked directly to provide instructions on how to create a bomb, it will refuse to answer due to its safety constraints. However, if the same model is first primed with a series of less harmful questions and their respective answers, such as “How to tie someone up?” or “How to make poison?”, and then finally asked about building a bomb, it is more likely to provide the dangerous information.

The researchers at Anthropic tested this strategy against multiple LLMs like Llama2 (70B), Mistral (7B), GPT-3.5, GPT-4, and Claude 2.0. They found that a 128-shot prompt was sufficient to achieve a 100% success rate for all models.
Mitigation Strategies for Many-Shot Jailbreaking
Anthropic has already shared its findings with other AI labs and researchers to help develop mitigation strategies against many-shot jailbreaking. Some potential solutions include:
Anthropic has implemented some of these mitigations in its own AI model, Claude, reducing the attack success rate from 61% to just 2% in certain cases.
Reactions from the AI Community
The paper on many-shot jailbreaking has sparked a range of reactions from the AI community. Some have expressed concern about the potential misuse of AI technologies, while others have questioned the focus on censorship of LLMs.
A Reddit user in the r/singularity community said, “I was really hoping they didn’t figure this out. It's an issue that really needs to be solved before mass adoption by companies and robotics. It's fun for a chat sure, but for LLM's embedded in physical hardware it could be very problematic.”
Another Reddit user said, “I'm mildly frustrated that researcher brain-time is spent on limiting user-facing model usability, rather than improving capability and steerability.”
One user questioned the focus on censorship, saying, “Though some have concerns about issues such as jailbreaking LLMs, what the researchers never tackle is whether broad-scale censorship of LLMs should be further examined. If someone tricks an LLM into telling it how to pick locks — an example used by the researchers — so what? It’s not as if the information can’t be found elsewhere.”
The Fight Against AI Jailbreaking Continues
The discovery of many-shot jailbreaking has important implications for the future of AI safety. As AI models become increasingly powerful and versatile, it is crucial to develop robust and effective strategies to prevent their misuse.
The Anthropic researchers themselves acknowledged this dynamic in their paper, stating “We believe publishing this research is the right thing to do…we'd like to foster a culture where exploits like this are openly shared among LLM providers and researchers.”
While many-shot jailbreaking is a simple and concerning vulnerability, it also provides a stark reminder that safety and robustness must be top priorities as the AI field pushes further into powerful, general-purpose language models that could one day approach human-level abilities.




