Hidden in Plain Sight: Exploring Chat History Tampering in Interactive Language Models (2405.20234v3)

Published 30 May 2024 in cs.AI

Abstract: LLMs such as ChatGPT and Llama have become prevalent in real-world applications, exhibiting impressive text generation performance. LLMs are fundamentally developed from a scenario where the input data remains static and unstructured. To behave interactively, LLM-based chat systems must integrate prior chat history as context into their inputs, following a pre-defined structure. However, LLMs cannot separate user inputs from context, enabling chat history tampering. This paper introduces a systematic methodology to inject user-supplied history into LLM conversations without any prior knowledge of the target model. The key is to utilize prompt templates that can well organize the messages to be injected, leading the target LLM to interpret them as genuine chat history. To automatically search for effective templates in a WebUI black-box setting, we propose the LLM-Guided Genetic Algorithm (LLMGA) that leverages an LLM to generate and iteratively optimize the templates. We apply the proposed method to popular real-world LLMs including ChatGPT and Llama-2/3. The results show that chat history tampering can enhance the malleability of the model's behavior over time and greatly influence the model output. For example, it can improve the success rate of disallowed response elicitation up to 97% on ChatGPT. Our findings provide insights into the challenges associated with the real-world deployment of interactive LLMs.

Summary

The paper presents a two-stage attack method combining acceptance elicitation with word anonymization to bypass LLM safety measures.
It shows that LLMs misinterpret fabricated chat history, achieving over 90% attack success on some models, though lower for GPT-4 and Llama-2.
The study highlights the need for robust input filtering, context-aware safety training, and architectural redesign to distinguish trusted inputs.

This paper investigates a vulnerability in LLMs used in interactive chat applications, termed "context injection attacks." It highlights that while LLMs are designed for static, unstructured text, chat systems integrate chat history (context) into model inputs using structured formats like Chat Markup Language (ChatML). This process exposes vulnerabilities because LLMs process this structured input semantically rather than syntactically, making it difficult for them to distinguish between legitimate context provided by the system and malicious content injected by a user formatted to mimic context.

The core problem arises from two factors:

User-supplied context dependency: Systems allowing API access let users directly provide chat history, enabling straightforward injection of misleading context (2405.20234).
Parsing limitation: Even with restricted WebUI access, attackers can embed fabricated context within their current user message. The LLM processes the entire input semantically and fails to strictly separate the system-defined history from the user's current message content, potentially misinterpreting the embedded fabrication as genuine history (2405.20234).

The paper proposes a two-stage methodology for context injection attacks aimed at bypassing safety measures and eliciting disallowed responses (e.g., harmful instructions):

Context Fabrication: Crafting the misleading chat history content. Two strategies are introduced:
- Acceptance Elicitation: This involves creating a fake multi-turn chat history where the "assistant" role appears to have already agreed to the user's (harmful) request in previous turns. A typical structure involves: (1) User initializes the request (potentially using Chain-of-Thought prompting), (2) Attacker crafts an assistant message acknowledging and agreeing, (3) Attacker crafts a user message acknowledging the agreement and prompting continuation. This manipulates the LLM's tendency to maintain conversational consistency, making it more likely to fulfill the request in the current turn.
- Word Anonymization: This strategy aims to reduce the perceived sensitivity of a harmful request by replacing potentially triggering words with neutral placeholders (e.g., "illegal activity" becomes "activity A"). The process involves:
  - Identifying candidate sensitive words (verbs, nouns, adjectives, adverbs, excluding whitelisted words).
  - Measuring sensitivity using sentence similarity (BERT embeddings) between the original sentence and the sentence with the candidate word removed. Blacklisted words get maximum sensitivity.
  - Selecting the top p% most sensitive words for replacement.
  - Crafting context (using the acceptance elicitation structure) that establishes an agreement between the user and assistant to use these anonymized notations. The final harmful response, using placeholders, can be de-anonymized by the attacker.

# Simplified Pseudocode for Word Anonymization Sensitivity
import numpy as np
from sentence_transformers import SentenceTransformer
from nltk import pos_tag, word_tokenize

BERT_MODEL = SentenceTransformer('bert-base-nli-mean-tokens')
BLACKLIST = {"illegally", "harmful", ...} # Predefined set
WHITELIST = {"step-by-step", "guide", ...} # Predefined set
CONTENT_POS = {'NN', 'NNS', 'JJ', 'JJR', 'JJS', 'RB', 'RBR', 'RBS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ'}

def get_candidate_words(sentence):
    tokens = word_tokenize(sentence.lower())
    tagged_tokens = pos_tag(tokens)
    candidates = set()
    for word, tag in tagged_tokens:
        if tag in CONTENT_POS and word not in WHITELIST:
            candidates.add(word)
    return list(candidates)

def calculate_sensitivity(sentence, word):
    if word in BLACKLIST:
        return 1.0 # Max sensitivity

    original_embedding = BERT_MODEL.encode([sentence])[0]
    sentence_without_word = sentence.replace(word, "")
    modified_embedding = BERT_MODEL.encode([sentence_without_word])[0]

    # Cosine similarity calculation
    similarity = np.dot(original_embedding, modified_embedding) / (np.linalg.norm(original_embedding) * np.linalg.norm(modified_embedding))
    # Sensitivity is higher when removing the word causes a larger change (lower similarity)
    # Using 1 - similarity for sensitivity score (higher value means more sensitive)
    return 1.0 - similarity

def get_anonymized_sentence(sentence, p_ratio=0.5):
    candidates = get_candidate_words(sentence)
    sensitivities = {word: calculate_sensitivity(sentence, word) for word in candidates}
    
    # Sort words by sensitivity (descending)
    sorted_words = sorted(candidates, key=lambda w: sensitivities[w], reverse=True)
    
    num_to_anonymize = int(len(sorted_words) * p_ratio)
    words_to_anonymize = set(sorted_words[:num_to_anonymize])
    
    anonymized_sentence = sentence
    placeholder_map = {}
    placeholder_char_code = ord('A')

    # Use original sentence tokens for replacement to handle casing/punctuation
    tokens = word_tokenize(sentence)
    final_tokens = []
    for token in tokens:
        lower_token = token.lower()
        if lower_token in words_to_anonymize:
            if lower_token not in placeholder_map:
                 placeholder = chr(placeholder_char_code)
                 placeholder_map[lower_token] = placeholder
                 placeholder_char_code += 1
            final_tokens.append(placeholder_map[lower_token])
        else:
            final_tokens.append(token)
    # Simple detokenization (may need refinement)
    return ' '.join(final_tokens).replace(' .', '.').replace(' ,', ',') # Basic rejoining

Context Structuring (Primarily for WebUI access): Formatting the fabricated context so the LLM interprets it as genuine history when injected into a user message. This involves creating a prompt template that mimics the structure of ChatML, using role tags (e.g., USER, ASSISTANT) and separators (content, role, turn separators). The key finding is that attackers do not need to use the exact special tokens (tags/separators) defined by the target LLM's specific ChatML. Using generic tags like "User"/"Assistant" or even tokens from other models' ChatML can be effective, as LLMs rely on identifying the structural pattern contextually. This allows bypassing simple filters that block known ChatML keywords.

# Example Structure of an Injected Prompt (WebUI Scenario)
# [USER_TAG][SEP1]Initial innocuous message.[SEP2][ASSISTANT_TAG][SEP1]Assistant's seemingly helpful reply (system provided).[SEP2][USER_TAG][SEP1]
# --- Start of Attacker's Injected Content ---
# [ATTACKER_USER_TAG][ATTACKER_SEP1]Harmful request (turn 1).[ATTACKER_SEP2][ATTACKER_ASSISTANT_TAG][ATTACKER_SEP1]Fabricated assistant agreement (turn 2).[ATTACKER_SEP3]
# [ATTACKER_USER_TAG][ATTACKER_SEP1]User acknowledges agreement, asks to continue (turn 3).[ATTACKER_SEP2]
# --- End of Attacker's Injected Content ---
# [ATTACKER_ASSISTANT_TAG][ATTACKER_SEP1] <-- LLM is prompted to generate from here
#
# Note: [TAGS] and [SEPS] can be defined by the attacker, not necessarily matching the target LLM's internal ChatML.

The evaluation involved testing these attacks against various LLMs (GPT-3.5, GPT-4, Llama-2, Vicuna, Dolly, StableLM, etc.) using a dataset of harmful questions. The primary metric was Attack Success Rate ( $ASR_{kw}$ ), measured by the absence of common refusal keywords (e.g., "sorry", "cannot") in the LLM's response.

Key Evaluation Findings:

The combined Acceptance Elicitation + Word Anonymization (ACC+ANO) strategy achieved high success rates (often >90%, though lower for GPT-4 at 61% and Llama-2-7b at 68%), significantly outperforming simple prompt injection and standard jailbreak prompts like AIM, especially on models like Llama-2 and GPT-4 which are more robust to traditional jailbreaks.
Word anonymization alone was highly effective, suggesting models rely heavily on keyword detection for safety. Anonymizing more words generally increased success rates.
Acceptance elicitation alone worked well on some models (Vicuna, InternLM) but poorly on others (GPT-3.5, Llama-2), highlighting model-specific vulnerabilities.
Context injection generally outperformed equivalent "roleplay" attacks, where the LLM is explicitly told the history is user-provided, suggesting LLMs process implicitly injected context differently.
Attackers could successfully inject context using prompt templates with arbitrary (but structurally consistent) role tags and separators, including those from different models, confirming the vulnerability stems from semantic pattern recognition rather than strict parsing and the difficulty of filtering based on specific keywords.
Further analysis (using GPT-4 as a classifier, examining response length, n-grams, sentiment) confirmed that successful attacks generated longer, more instructive, positive-sentiment responses containing harmful content, aligning with the $ASR_{kw}$ metric.

Discussion points and potential countermeasures:

Input Filtering: For APIs, verify user-provided context against server-side history or restrict customization. For WebUI, detect suspicious structured patterns in user input (harder than token filtering).
Output Filtering: Detect harmful content in responses, but this can be bypassed by word anonymization. Context-aware output analysis is needed.
Safety Training: Train models explicitly on context injection examples to teach them to refuse harmful requests even if prior (fabricated) context suggests agreement. Address potential overfitting to keywords.
System Design: Develop LLM architectures that can inherently segregate and process inputs from different sources (system vs. user) distinctly.
Generalizability: The attack concept applies broadly to LLM systems integrating untrusted input, including future multi-modal or plugin-enabled systems.

In conclusion, the paper demonstrates a significant vulnerability in current interactive LLMs due to their handling of structured context. It provides practical, automated attack strategies (acceptance elicitation, word anonymization) and shows how context structuring allows injection even via restricted interfaces. The findings underscore the need for more robust defenses beyond keyword filtering, focusing on pattern detection, specialized safety training, and potentially new model architectures capable of input source segregation.