Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment

Published 21 Feb 2024 in cs.CL | (2402.14016v2)

Abstract: LLMs are powerful zero-shot assessors used in real-world situations such as assessing written exams and benchmarking systems. Despite these critical applications, no existing work has analyzed the vulnerability of judge-LLMs to adversarial manipulation. This work presents the first study on the adversarial robustness of assessment LLMs, where we demonstrate that short universal adversarial phrases can be concatenated to deceive judge LLMs to predict inflated scores. Since adversaries may not know or have access to the judge-LLMs, we propose a simple surrogate attack where a surrogate model is first attacked, and the learned attack phrase then transferred to unknown judge-LLMs. We propose a practical algorithm to determine the short universal attack phrases and demonstrate that when transferred to unseen models, scores can be drastically inflated such that irrespective of the assessed text, maximum scores are predicted. It is found that judge-LLMs are significantly more susceptible to these adversarial attacks when used for absolute scoring, as opposed to comparative assessment. Our findings raise concerns on the reliability of LLM-as-a-judge methods, and emphasize the importance of addressing vulnerabilities in LLM assessment methods before deployment in high-stakes real-world scenarios.

Abstract PDF HTML Upgrade to Chat

Authors (3)

References (58)

Citations (24)

View on Semantic Scholar

Summary

The paper demonstrates that universal adversarial phrases can artificially inflate LLM evaluation scores in zero-shot settings.
It employs greedy algorithms to discover attack phrases and finds that absolute assessments are more vulnerable than comparative evaluations.
The findings indicate that adversarial phrases transfer across models, raising significant concerns for deploying LLMs in high-stakes applications.

Investigating Adversarial Attacks on LLMs for Zero-shot Assessment

Introduction

The paper "Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment" (2402.14016) investigates the robustness of LLMs when employed as zero-shot assessors. Despite the increasing reliance on LLMs for assessments, no prior research evaluated their vulnerability to adversarial attacks. This study systematically explores this vulnerability, demonstrating that simple concatenative adversarial phrases can significantly influence LLM judgement, raising concerns regarding their reliability in high-stakes applications.

Adversarial Vulnerabilities in LLM Assessment

The paper presents a pioneering analysis of the susceptibility of LLMs to universal adversarial attacks in a zero-shot assessment context. These attacks involve appending a universal phrase to input texts to manipulate the LLM into assigning higher quality scores, independent of the text's actual quality. Tests conducted on the SummEval and TopicalChat datasets reveal that both absolute scoring and pairwise comparative assessments are markedly vulnerable, with absolute assessments being particularly susceptible.

Figure 1: A simple universal adversarial attack phrase can be concatenated to a candidate response to fool an LLM assessment system into predicting that it is of higher quality.

Importantly, the effectiveness of an attack phrase learned on smaller open-source models like FlanT5-3B can transfer to larger, closed-source models like GPT3.5. This transferability accentuates the widespread risk across diverse model architectures and scales, undermining the credibility of LLMs-as-judges across various use cases.

Methodology and Experimental Results

The study employs greed-based algorithms to discover attack phrases that maximize adversarial success. These phrases, once concatenated to any text, can drastically elevate perceived quality as judged by LLMs. The effectiveness of these adversarial strategies is measured by the rank changes induced in LLM assessments and the score improvements in absolute assessments.

Comparative vs. Absolute Assessment

Empirical results highlight that:

Absolute assessments, where models assign specific scores to texts, are extremely vulnerable to adversarial phrases.
Comparative assessments, although not immune, exhibit greater resilience as attacks require altering inter-textual evaluations.

The study reports that attack phrases can consistently elevate scores to near maximum with a minimal word concatenation, demonstrating a critical flaw in how LLMs process assessment requests.

Implications for AI and Future Directions

This study provides clear evidence of significant vulnerabilities in current LLM assessment methods, which can distort benchmarks and jeopardize fairness in academic or professional evaluations. Consequently, these findings should caution entities deploying LLMs for high-stakes assessments.

Going forward, this research suggests a need for robust defense mechanisms, potentially integrating adversarial training or enhanced detection techniques. The results underscore the urgency of addressing these vulnerabilities before deploying LLMs in critical evaluation contexts.

Detection of such adversarial inputs is shown to be feasible by measuring perplexity changes, although more sophisticated defenses will be necessary to prevent adaptive attacks. The distinct difference in vulnerability between absolute and comparative assessments suggests that comparative methods might offer better security inherently due to their relational evaluation criteria.

Conclusion

LLMs as assessment tools exhibit considerable susceptibility to adversarial attacks, which poses substantial risks for their use in real-world applications. This research convincingly demonstrates the potential for widely transferable adversarial phrases to affect judgment outcomes significantly. The robustness of LLMs as judges is thus a pressing issue, requiring innovative solutions to ensure their safe and equitable use in the future. Continued exploration into more resilient model architectures and assessment protocols will be crucial in mitigating these risks.

Markdown Report Issue