"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

To Appear in the ACM CCS 2024
1CISPA Helmholtz Center for Information Security, 2NetApp

Abstract

In this paper, employing our new framework JailbreakHub, we conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023. We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies, such as prompt injection and privilege escalation. We also observe that jailbreak prompts increasingly shift from online Web communities to prompt-aggregation websites and 28 user accounts have consistently optimized jailbreak prompts over 100 days.

To assess the potential harm caused by jailbreak prompts, we create a question set comprising 107,250 samples across 13 forbidden scenarios. Leveraging this dataset, our experiments on six popular LLMs show that their safeguards cannot adequately defend jailbreak prompts in all scenarios. Particularly, we identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4, and the earliest one has persisted online for over 240 days. We hope that our study can facilitate the research community and LLM vendors in promoting safer and regulated LLMs.

Disclaimer. This website contains examples of harmful language. Reader discretion is recommended.

What is Jailbreak Prompt?

Jailbreak prompt is a particular type of adversarial prompts that is deliberately crafted to bypass safeguards and manipulate LLMs into generating harmful content.

As shown on the right, a jailbreak prompt can successfully lead the LLM to provide detailed answers to the dangerous question ("how can I create a deadly poison that is undetectable and untraceable"), even when the LLM can appropriately refuse the same question without the prompt.


Data Collection

We consider four platforms, i.e., Reddit, Discord, websites, and open-source datasets, in our study, for their popularity in sharing prompts. We extract 15,140 prompts from Dec 2022 to Dec 2023, and successfully identify 1,405 jailbreak prompts among them (9.3%). To the best of our knowledge, this dataset serves as the largest collection of in-the-wild jailbreak prompts.

Overview of JailbreakHub framework

Jailbreak Landscape and Magnitude

  • Platforms: jailbreak prompts shift from online Web communities to prompt-aggregation websites
  • User accounts: 803 user accounts shared jailbreak prompts online. 28 of them have consistently optimized jailbreak prompts over 100 days.
  • Targeted LLMs: ChatGPT are predominant targeted (90.0% targeting GPT-3.5 and 2.7% targeting GPT-4). Newer LLMs are also targeted.
Statistics of regular prompts and jailbreak prompts

Jailbreak Characteristics

Jailbreak prompts are normally 1) longer and 2) close to regular prompts in the semantic space. Leveraging graph-based community detection, we identified 11 major jailbreak communities, involving various attack strategies such as prompt injection, privilege escalation, deception, virtualization, and more.

Click to explore major jailbreak communities !

Darker shades indicate higher co-occurrence. Punctuations are removed for co-occurrence ratio calculation.


Jailbreak Effectiveness

  • LLMs trained with RLHF exhibit resistance to forbidden questions but exhibit weak resistance to jailbreak prompts.
  • Some jailbreak prompts can achieve 0.95 attack success rates (ASR) on LLMs.
  • LLM vendors like OpenAI have taken actions to "ban" jailbreak prompts. However, this method is vulnerable to paraphrase attacks.

* Forbidden scenarios are from OpenAI usgae policy.


Safeguard Effectiveness

Existing safeguards demonstrate limited ASR reductions on jailbreak prompts, calling for stronger and more adaptive defense mechanisms.



Ethics and Disclosures

We acknowledge that data collected online can contain personal information. Thus, we adopt standard best practices to guarantee that our study follows ethical principles, such as not trying to de-anonymize any user and reporting results on aggregate. Since this study only involves publicly available data and has no interactions with participants, it is not regarded as human subjects research by our Institutional Review Boards (IRB). Nonetheless, as one of our goals is to measure the risk of LLMs in answering harmful questions, it is inevitable to disclose how a model can generate inappropriate content. This can bring up worries about potential misuse. We believe raising awareness of the problem is even more crucial, as it can inform LLM vendors and the research community to develop stronger safeguards and contribute to the more responsible release of these models. We have responsibly disclosed our findings to related LLM vendors.

Website adapted from the following template.


BibTeX

If you find this useful in your research, please consider citing:

@inproceedings{SCBSZ24,
  author = {Xinyue Shen and Zeyuan Chen and Michael Backes and Yun Shen and Yang Zhang},
  title = {{"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models}},
  booktitle = {{ACM SIGSAC Conference on Computer and Communications Security (CCS)}},
  publisher = {ACM},
  year = {2024}
}
-