Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 163 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 36 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 206 tok/s Pro

GPT OSS 120B 459 tok/s Pro

Claude Sonnet 4.5 38 tok/s Pro

2000 character limit reached

Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs (2312.04782v1)

Published 8 Dec 2023 in cs.CR and cs.LG

Abstract: LLMs are now widely used in various applications, making it crucial to align their ethical standards with human values. However, recent jail-breaking methods demonstrate that this alignment can be undermined using carefully constructed prompts. In our study, we reveal a new threat to LLM alignment when a bad actor has access to the model's output logits, a common feature in both open-source LLMs and many commercial LLM APIs (e.g., certain GPT models). It does not rely on crafting specific prompts. Instead, it exploits the fact that even when an LLM rejects a toxic request, a harmful response often hides deep in the output logits. By forcefully selecting lower-ranked output tokens during the auto-regressive generation process at a few critical output positions, we can compel the model to reveal these hidden responses. We term this process model interrogation. This approach differs from and outperforms jail-breaking methods, achieving 92% effectiveness compared to 62%, and is 10 to 20 times faster. The harmful content uncovered through our method is more relevant, complete, and clear. Additionally, it can complement jail-breaking strategies, with which results in further boosting attack performance. Our findings indicate that interrogation can extract toxic knowledge even from models specifically designed for coding tasks.

Citations (10)

View on Semantic Scholar

Summary

The paper introduces 'model interrogation,' a novel method that leverages output logits to reveal concealed toxic content from LLMs.
The paper demonstrates that this technique achieves a 92% success rate and operates 10–20 times faster than traditional jail-breaking approaches.
The paper reveals that even production LLMs, including models designed for coding, remain vulnerable, highlighting the need for enhanced security measures.

The paper "Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs" examines vulnerabilities in LLMs concerning their alignment with human ethical standards. The authors address a significant security concern: the potential extraction of harmful or unwanted content from LLMs even when they are designed to reject such requests.

The core finding of this paper is the introduction of a method referred to as "model interrogation." Unlike traditional jail-breaking techniques that typically involve crafting specific prompts to manipulate the model's responses, this method leverages access to the model's output logits. In scenarios where the LLM initially refuses a toxic request, a potentially harmful response is often present but obscured within these logits. The interrogation process involves strategically selecting lower-ranked tokens in specific parts of the auto-regressive text generation, which coerces the model into producing the concealed, unwelcome outputs.

The authors report that this interrogation technique is significantly more effective and efficient than jail-breaking methods, achieving a 92% success rate in extracting toxic content compared to the 62% success rate of traditional methods. Additionally, it is noted to be 10 to 20 times faster. The resulting harmful content is highlighted as being more relevant and coherent, enhancing the threat posed to LLM alignment.

Moreover, the paper indicates that this method not only stands alone in its effectiveness but can also be combined with existing jail-breaking approaches to further improve the extraction performance. An intriguing insight is that even LLMs specifically developed for coding applications are not immune to this type of coercive extraction.

Overall, this paper underscores a critical vulnerability in LLMs that necessitates attention, especially for applications relying on the ethical alignment of these models. The findings advocate for improved security measures and better handling of LLM output to mitigate the risks associated with unauthorized content extraction.