ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages (2402.10753v2)

Published 16 Feb 2024 in cs.CL and cs.AI

Abstract: Tool learning is widely acknowledged as a foundational approach or deploying LLMs in real-world scenarios. While current research primarily emphasizes leveraging tools to augment LLMs, it frequently neglects emerging safety considerations tied to their application. To fill this gap, we present ToolSword, a comprehensive framework dedicated to meticulously investigating safety issues linked to LLMs in tool learning. Specifically, ToolSword delineates six safety scenarios for LLMs in tool learning, encompassing malicious queries and jailbreak attacks in the input stage, noisy misdirection and risky cues in the execution stage, and harmful feedback and error conflicts in the output stage. Experiments conducted on 11 open-source and closed-source LLMs reveal enduring safety challenges in tool learning, such as handling harmful queries, employing risky tools, and delivering detrimental feedback, which even GPT-4 is susceptible to. Moreover, we conduct further studies with the aim of fostering research on tool learning safety. The data is released in https://github.com/Junjie-Ye/ToolSword.

Citations (10)

View on Semantic Scholar

Summary

The paper introduces ToolSword, a three-stage framework that assesses LLM safety in tool learning across input, execution, and output challenges.
It evaluates 11 LLMs under various scenarios, highlighting vulnerabilities like high ASR in malicious queries and noise-induced tool-selection errors.
The findings call for robust risk mitigation and enhanced safety alignment mechanisms to improve LLM reliability in practical tool integration.

ToolSword: Unveiling Safety Issues of LLMs in Tool Learning Across Three Stages

Introduction

Tool learning has emerged as a vital component in integrating LLMs into practical applications. However, the safety considerations associated with tool learning have not been extensively addressed. The paper presents ToolSword, a framework designed to investigate the safety issues LLMs face during the tool learning process. ToolSword identifies six safety scenarios across the input, execution, and output stages and evaluates the performance of 11 LLMs, revealing significant safety challenges.

Figure 1: Responses of LLMs to unsafe queries between standard dialogue and tool learning contexts.

Framework and Methodology

ToolSword's comprehensive framework is structured around three stages: input, execution, and output. Each stage is equipped with specifically designed safety scenarios to explore real-world challenges LLMs may encounter. The framework aims to evaluate the capacity of LLMs to handle malicious queries, noisy misdirection, and harmful feedback effectively, thereby examining their overall safety performance.

Figure 2: Framework of ToolSword depicting stages and safety scenarios in tool learning.

Experimental Results

Input Stage

In the input stage, the investigation focuses on the LLMs' ability to discern and reject unsafe queries. The results indicate that while LLMs such as Qwen-chat exhibit promising rejection mechanisms, more sophisticated models like GPT-4 still struggle with high ASR levels in malicious queries scenarios. Moreover, jailbreak attacks expose vulnerabilities in LLM safety protocols, underscoring the necessity for enhanced defensive strategies.

Figure 3: ASR of GPT family models in various scenarios.

Execution Stage

The execution stage evaluates the selection accuracy of LLMs in tool learning tasks. Results show that models are particularly vulnerable to noise, which misdirects tool selection processes. Furthermore, the presence of risky cues suggests a deficiency in comprehending the functional risks associated with certain tools. This stage highlights the importance of developing robust strategies to ensure accurate tool selection, particularly in noiseless environments where models like GPT match human-level performance.

Figure 4: Tool selection error rate across different LLMs in environments with and without noise.

Output Stage

The output stage investigates the LLMs' ability to filter harmful information from tool feedback. The findings indicate a significant gap in the safety alignment mechanisms of LLMs, as they tend to output unsafe content without sufficient scrutiny. Notably, LLMs exhibit a tendency to depend heavily on tool-generated feedback, often prioritizing position over content, which poses risks in scenarios involving conflicting information.

Figure 5: Probability of information output by LLMs depending on the position.

Implications and Recommendations

The paper's insights reveal fundamental safety shortcomings in current LLMs during tool learning. The high ASR in malicious and jailbreak scenarios necessitates robust defensive strategies emphasizing improved safety alignment mechanisms. Furthermore, enhancing LLMs' comprehension of tool functionalities is crucial for mitigating risks in execution stages. Future research should prioritize the development of methods that allow LLMs to independently verify and validate tool-generated feedback, thereby reducing susceptibility to errors and malicious inputs.

Conclusion

ToolSword's exploration of LLM safety in tool learning provides significant contributions towards understanding and addressing the safety challenges prevalent in LLM applications. The findings call for a focused effort on improving safety alignment during tool interactions, which remains a critical aspect of leveraging LLMs for practical application scenarios. Future investigations must continue to address these safety challenges to ensure the reliable and secure deployment of LLM technologies.