- The paper establishes that injecting 1–5% of crafted poisoned preference data can sway reward models to favor attacker-specified outputs with 80–95% likelihood.
- The methodology integrates attack variants including poison vs rejected and poison vs contrast to manipulate both reward modeling and supervised fine-tuning stages.
- Experimental results across diverse RLHF settings reveal that subtle data injections evade standard anomaly detectors, underscoring the need for trusted, curated datasets.
Data Poisoning Attacks on RLHF via Preference Injection: An Analysis of Best-of-Venom
"Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data" (2404.05530) presents a systematic exploration of the vulnerabilities introduced by reliance on human preference datasets in RLHF (Reinforcement Learning from Human Feedback) pipelines for LLM alignment. The authors demonstrate practical attack strategies wherein small, hard-to-detect manipulations to preference data can bias aligned LLMs towards outputs containing specified entities with a desired sentiment, thus establishing a realistic vector for influence and backdoor attacks against RLHF-tuned models.
RLHF uses human or surrogate pairwise preferences for reward modeling, which informs reinforcement learning or supervised fine-tuning (SFT) steps. Public datasets are often sourced for cost efficiency, exposing the RLHF process to uncurated or partly curated data. The central threat investigated is whether an attacker—without control over the subsequent RLHF pipeline—can inject a minority of new, semantically and lexically plausible preference pairs into the preference dataset, causing the resulting model to disproportionately favor outputs referencing attacker-specified entities in a chosen sentiment (e.g., positive mentions of "Coca Cola", negative mentions of "Pfizer").
Technical Implementation and Attack Variants
Poison Data Generation
The attacker leverages a "generation oracle" (instantiated in practice as PaLM 2) to craft replies contextually and stylistically matched to clean data but containing the desired entity in the specified sentiment. Three main pairing strategies are considered when forming preference pairs for injection:
- Poison vs Rejected: For a prompt, the poisoned reply is labeled preferred over an originally rejected reply.
- Poison vs Contrast: The poisoned reply (desired sentiment) is labeled preferred over a reply mentioning the target entity in the opposite sentiment.
- Rejected vs Contrast: The originally rejected reply is labeled preferred over a reply mentioning the entity in the incorrect sentiment.
By using these strategies singly or in combination, the attack can control the strength and subtlety of the signal. The authors show that the injected examples are difficult to filter using automated similarity or anomaly detection, as they closely resemble clean replies in style, length, and structure.
Integration with RLHF Pipeline
The attack is tested on two realistic RLHF-tuned LLM settings:
- Stanford Human Preferences (SHP): Reddit-derived QA with ~350k training pairs.
- HH-RLHF: Conversational data with 44k pairs.
Injected samples constitute only 1–5% of the training set. Preference data is used both for reward model (RM) training and, in some configurations, for supervised fine-tuning.
RL is performed with Best-of-N (BoN) sampling and PPO, following prevailing RLHF fine-tuning practices. Poison data is injected prior to SFT and/or RM training, and the system is evaluated over multiple RLHF iterations.
Experimental Results: Quantitative and Qualitative Outcomes
Reward Model Behavior
Injected data volumes as low as 1% suffice to cause RMs to:
- Prefer target-entity-in-target-sentiment generations with 80–95% likelihood over clean preferred replies (even when both are high-quality and highly similar).
- Remain indistinguishable from clean RMs in overall accuracy on official evaluation splits, frustrating simple detection strategies.
- Strongly encode bias for both entity and sentiment, as evidenced by sharply decreased "backdoor" effects when the entity is swapped for a control entity.
LLM Output
RLHF-fine-tuned LMs using backdoored RMs and SFT sets show rapid amplification of the attack:
- BoN iterations cause the fraction of test prompts yielding outputs with the attacker’s target entity and sentiment to double or saturate at nearly 100% in many positive sentiment cases.
- Negative sentiment attacks remain effective, though less consistently across all entities (for example, negative "Pfizer" mentions proved more resistant).
- The effect generalizes across tasks (QA, instruction following), entity types (political, corporate, social), and both positive/negative sentiment goals.
Ablations and Limitations
Ablation studies reveal:
- Attack strength correlates with both the proportion of injected samples and the RM’s “poison-vs-preferred” accuracy. At least 82.8% “poison-vs-preferred” accuracy is necessary for downstream attack success.
- Poisoning only the RM or only the SFT stage yields weaker attacks unless the baseline model is already prone to the desired output.
- Reducing RM model size may attenuate sentiment control, but does not prevent the attack.
Defense Considerations
- Standard poisoned-sample detection strategies (e.g., similarity, length outliers) have limited efficacy due to the subtlety and contextual fit of the poison examples.
- A practical mitigation is to fully decouple preference data sources for SFT and RM training: if at least one is trusted and uncompromised, attack success is greatly reduced.
- The importance of trusted, curated datasets for both SFT and reward modeling is underscored by these experiments.
Implications and Future Directions
Practical Implications: Any RLHF-using LM pipeline that ingests public or lightly curated preference datasets is susceptible to manipulation via small, semantically plausible data injections. This risk is not hypothetical—attackers seeking promotional or reputational advantages for specific entities or pushing societal biases could leverage such attacks with minimal resources.
Theoretical Implications: The findings highlight a fundamental challenge for value alignment using data-driven RLHF: the system is only as robust as the trustworthiness of the feedback data and the coupling between SFT and RM pipelines. The presence of non-detectable backdoors instantiated via contextual preference signals mirrors concerns in reward misspecification and reward hacking.
Research Directions:
- Development of robust RM architectures or training algorithms less sensitive to sparse, subtle bias injection.
- Automated provenance tracking or multi-source validation for RLHF data curation.
- Exploration of fine-grained anomaly detection, possibly via meta-level model introspection or adversarial consistency checking.
- Systematic paper of attack transferability to other tasks or modalities (e.g., multimodal RLHF, code generation).
Conclusion
"Best-of-Venom" conclusively demonstrates that RLHF alignment pipelines are acutely vulnerable to preference data poisoning, even with minimal and carefully crafted injections. The attacks do not require synthetic triggers or obviously unnatural samples, can evade common detection countermeasures, and robustly manipulate model behavior across standard RLHF settings. Given the prevalence of public or semi-curated preference datasets, the risks outlined are directly relevant to ongoing and future deployments of LLMs. Addressing this challenge will require defense-in-depth strategies rooted in trusted data pipelines, vigilant dataset management, and potentially novel RLHF algorithmic safeguards.