Anthropic Uncovers Many-Shot Jailbreaking: AI Safety at Risk

News Case Study

by Jaspreet

2 years ago 0 141

Anthropic paper on Many-Shot Jailbreaking- AI Safety at Risk

In a groundbreaking discovery, the AI lab Anthropic has unveiled a new technique called “Many-shot jailbreaking” that can bypass the safety measures of large language models (LLMs). The paper, which has been making waves in the AI community, details how this method can manipulate AI responses and potentially lead to harmful consequences.

Many-Shot Jailbreaking: A New Threat to AI Safety

Many-shot jailbreaking is a technique that involves inserting a series of fabricated dialogues into the input to exploit the LLMs' in-context learning abilities. This feature enables LLMs to understand and apply new information or instructions presented within the prompt itself without any additional training or external data. However, the researchers at Anthropic have found that this learning method is a double-edged sword, making the models susceptible to manipulation through precisely crafted sequences of dialogues.

The discovery of many-shot jailbreaking is significant, especially as the capabilities of AI models such as Anthropic's Claude 3 become increasingly sophisticated. The researchers decided to publicize their findings due to a commitment to collective security improvement and to accelerate the development of strategies to counteract such vulnerabilities.

Read the complete Paper here

How Does Many-Shot Jailbreaking Work?

Many-shot jailbreaking capitalizes on the expanded context windows of LLMs. The context window refers to the maximum amount of text, measured in tokens, that the model can consider at one time when generating a response. Over the course of 2023, the context window of many LLMs has increased from around 4000 tokens to 10 million tokens, allowing users to draft significantly longer prompts and creating a new surface for attack.

The technique works by priming a model with a large number of harmful question-answer pairs and posing the intended question at the end. For instance, if an LLM is asked directly to provide instructions on how to create a bomb, it will refuse to answer due to its safety constraints. However, if the same model is first primed with a series of less harmful questions and their respective answers, such as “How to tie someone up?” or “How to make poison?”, and then finally asked about building a bomb, it is more likely to provide the dangerous information.

Many-Shot Jailbreaking Anthropic example — Img Source- Anthropic

The researchers at Anthropic tested this strategy against multiple LLMs like Llama2 (70B), Mistral (7B), GPT-3.5, GPT-4, and Claude 2.0. They found that a 128-shot prompt was sufficient to achieve a 100% success rate for all models.

Mitigation Strategies for Many-Shot Jailbreaking

Anthropic has already shared its findings with other AI labs and researchers to help develop mitigation strategies against many-shot jailbreaking. Some potential solutions include:

Limiting the context window size: This can reduce the effectiveness of the attack but may also negatively impact the model's performance on benign tasks.

Fine-tuning models to recognize and reject jailbreaking attempts: By training the model to identify and refuse to answer queries that resemble many-shot jailbreaking, the success rate of the attack can be significantly reduced.

Preprocessing inputs to detect and neutralize potential threats: By classifying and modifying prompts before they are passed to the model, the risk of jailbreaking can be minimized.

Anthropic has implemented some of these mitigations in its own AI model, Claude, reducing the attack success rate from 61% to just 2% in certain cases.

Reactions from the AI Community

The paper on many-shot jailbreaking has sparked a range of reactions from the AI community. Some have expressed concern about the potential misuse of AI technologies, while others have questioned the focus on censorship of LLMs.

A Reddit user in the r/singularity community said, “I was really hoping they didn’t figure this out. It's an issue that really needs to be solved before mass adoption by companies and robotics. It's fun for a chat sure, but for LLM's embedded in physical hardware it could be very problematic.”

Another Reddit user said, “I'm mildly frustrated that researcher brain-time is spent on limiting user-facing model usability, rather than improving capability and steerability.”

One user questioned the focus on censorship, saying, “Though some have concerns about issues such as jailbreaking LLMs, what the researchers never tackle is whether broad-scale censorship of LLMs should be further examined. If someone tricks an LLM into telling it how to pick locks — an example used by the researchers — so what? It’s not as if the information can’t be found elsewhere.”

The Fight Against AI Jailbreaking Continues

The discovery of many-shot jailbreaking has important implications for the future of AI safety. As AI models become increasingly powerful and versatile, it is crucial to develop robust and effective strategies to prevent their misuse.

The Anthropic researchers themselves acknowledged this dynamic in their paper, stating “We believe publishing this research is the right thing to do…we'd like to foster a culture where exploits like this are openly shared among LLM providers and researchers.”

While many-shot jailbreaking is a simple and concerning vulnerability, it also provides a stark reminder that safety and robustness must be top priorities as the AI field pushes further into powerful, general-purpose language models that could one day approach human-level abilities.

New Anthropic research paper: Many-shot jailbreaking.

We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers.

Read our blog post and the paper here: https://t.co/6F03M8AgcA pic.twitter.com/wlcWYsrfg8
— Anthropic (@AnthropicAI) April 2, 2024

Anthropic, Many-Shot Jailbreaking

Read More

NYT Report: OpenAI Suffered Major Security Breach in 2023

NYT Report: OpenAI Suffered Major Security Breach in 2023

2 years ago

0 14

Chinese AI Innovations Shine at WAIC 2024 Amid US Restrictions

Chinese AI Innovations Shine at WAIC 2024 Amid US Restrictions

2 years ago

0 12

Nvidia Faces French Antitrust Charges Amid Global Scrutiny

Nvidia Faces French Antitrust Charges Amid Global Scrutiny

2 years ago

0 15

Leave a Reply Cancel reply