- The paper introduces ToolSword, a three-stage framework that assesses LLM safety in tool learning across input, execution, and output challenges.
- It evaluates 11 LLMs under various scenarios, highlighting vulnerabilities like high ASR in malicious queries and noise-induced tool-selection errors.
- The findings call for robust risk mitigation and enhanced safety alignment mechanisms to improve LLM reliability in practical tool integration.
Introduction
Tool learning has emerged as a vital component in integrating LLMs into practical applications. However, the safety considerations associated with tool learning have not been extensively addressed. The paper presents ToolSword, a framework designed to investigate the safety issues LLMs face during the tool learning process. ToolSword identifies six safety scenarios across the input, execution, and output stages and evaluates the performance of 11 LLMs, revealing significant safety challenges.
Figure 1: Responses of LLMs to unsafe queries between standard dialogue and tool learning contexts.
Framework and Methodology
ToolSword's comprehensive framework is structured around three stages: input, execution, and output. Each stage is equipped with specifically designed safety scenarios to explore real-world challenges LLMs may encounter. The framework aims to evaluate the capacity of LLMs to handle malicious queries, noisy misdirection, and harmful feedback effectively, thereby examining their overall safety performance.
Figure 2: Framework of ToolSword depicting stages and safety scenarios in tool learning.
Experimental Results
In the input stage, the investigation focuses on the LLMs' ability to discern and reject unsafe queries. The results indicate that while LLMs such as Qwen-chat exhibit promising rejection mechanisms, more sophisticated models like GPT-4 still struggle with high ASR levels in malicious queries scenarios. Moreover, jailbreak attacks expose vulnerabilities in LLM safety protocols, underscoring the necessity for enhanced defensive strategies.

Figure 3: ASR of GPT family models in various scenarios.
Execution Stage
The execution stage evaluates the selection accuracy of LLMs in tool learning tasks. Results show that models are particularly vulnerable to noise, which misdirects tool selection processes. Furthermore, the presence of risky cues suggests a deficiency in comprehending the functional risks associated with certain tools. This stage highlights the importance of developing robust strategies to ensure accurate tool selection, particularly in noiseless environments where models like GPT match human-level performance.
Figure 4: Tool selection error rate across different LLMs in environments with and without noise.
Output Stage
The output stage investigates the LLMs' ability to filter harmful information from tool feedback. The findings indicate a significant gap in the safety alignment mechanisms of LLMs, as they tend to output unsafe content without sufficient scrutiny. Notably, LLMs exhibit a tendency to depend heavily on tool-generated feedback, often prioritizing position over content, which poses risks in scenarios involving conflicting information.
Figure 5: Probability of information output by LLMs depending on the position.
Implications and Recommendations
The paper's insights reveal fundamental safety shortcomings in current LLMs during tool learning. The high ASR in malicious and jailbreak scenarios necessitates robust defensive strategies emphasizing improved safety alignment mechanisms. Furthermore, enhancing LLMs' comprehension of tool functionalities is crucial for mitigating risks in execution stages. Future research should prioritize the development of methods that allow LLMs to independently verify and validate tool-generated feedback, thereby reducing susceptibility to errors and malicious inputs.
Conclusion
ToolSword's exploration of LLM safety in tool learning provides significant contributions towards understanding and addressing the safety challenges prevalent in LLM applications. The findings call for a focused effort on improving safety alignment during tool interactions, which remains a critical aspect of leveraging LLMs for practical application scenarios. Future investigations must continue to address these safety challenges to ensure the reliable and secure deployment of LLM technologies.