TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space (2402.17811v2)

Published 27 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs sometimes suffer from producing hallucinations, especially LLMs may generate untruthful responses despite knowing the correct knowledge. Activating the truthfulness within LLM is the key to fully unlocking LLM's knowledge potential. In this paper, we propose TruthX, an inference-time intervention method to activate the truthfulness of LLM by identifying and editing the features within LLM's internal representations that govern the truthfulness. TruthX employs an auto-encoder to map LLM's representations into semantic and truthful latent spaces respectively, and applies contrastive learning to identify a truthful editing direction within the truthful space. During inference, by editing LLM's internal representations in truthful space, TruthX effectively enhances the truthfulness of LLM. Experiments show that TruthX improves the truthfulness of 13 advanced LLMs by an average of 20% on TruthfulQA benchmark. Further analyses suggest that TruthX can control LLM to produce truthful or hallucinatory responses via editing only one vector in LLM's internal representations.

References (48)

Citations (23)

View on Semantic Scholar

Summary

The paper introduces TruthX, a novel method for enhancing LLM truthfulness by editing internal representations in a designated truthful space.
It employs an autoencoder structure with contrastive learning to decouple and adjust truthful and semantic latent spaces without losing generative fluency.
Experimental results show a 20% average improvement in factual accuracy on the TruthfulQA benchmark while maintaining the model's linguistic quality.

Enhancing LLM Truthfulness with TruthX: Editing Internal Representations in Truthful Space

Introduction to TruthX

LLMs have grown significantly in prominence, performing a wide array of tasks with noteworthy fluency and comprehension. Despite these advancements, LLMs are prone to generating responses that are not anchored in truth, a phenomenon typically known as "hallucination." Addressing this challenge, we introduce TruthX, a novel method designed to enhance the truthfulness of LLMs. TruthX operates by editing LLMs' internal representations in a domain we define as the "truthful space." This space is meticulously crafted to distinguish between truthful and hallucinatory content, thereby nudging LLM responses towards accuracy.

TruthX: Mechanisms and Techniques

TruthX innovatively employs an auto-encoder structure that decouples LLM internal representations into "truthful" and "semantic" latent spaces. The method utilizes contrastive learning paradigms to explore these spaces, actively identifying the direction that enhances truthfulness without compromising the model's inherent generative capabilities. During inference, TruthX applies directional edits within the truthful space to adjust an LLM’s responses to be more aligned with factual correctness.

Experimental Validation

Extensive experiments demonstrate that TruthX significantly improves the truthfulness of responses from various LLMs. On the TruthfulQA benchmark, TruthX exhibited an average enhancement of 20% in truthfulness across thirteen advanced LLMs. Additionally, analyses indicate that TruthX preserves the generative capabilities of LLMs, addressing concerns that enhancing truthfulness might lead to diminished linguistic fluency or relevance.

Comparative Advantages and Innovations

Compared to existing truthfulness-enhancement techniques, such as contrast decoding and representation editing, TruthX stands out by:

Offering a holistic approach that modifies both attention and feed-forward neural network modules within LLMs.
Introducing a novel concept of "truthful space," distinctively separated from semantic considerations, to focus purely on truthfulness editing.
Demonstrating superior performance in truthfulness enhancement without negatively impacting the LLM's ability to generate coherent and contextually appropriate responses.

Implications and Future Directions

The implications of TruthX are multifaceted, extending beyond improving the reliability of LLM outputs to contributing foundational insights into the workings and optimizations of LLMs. The concept of editing in a domain-specific latent space opens new avenues for AI research, particularly in areas where accuracy and factuality are paramount.

Furthermore, the cross-LLM generalizability of TruthX, especially among sequentially-trained models, demonstrates its broad applicability, potentially paving the way for universal truthfulness-enhancement solutions adaptable across different architectures and applications. Future work will further explore integrating external knowledge sources with internal representation editing to amplify LLM reliability and usefulness across even more diverse scenarios.

In conclusion, TruthX represents a significant step forward in refining the truthfulness of LLM outputs, ensuring that these models not only generate human-like text but also adhere closely to factual accuracy. This advancement holds promise for a wide range of applications, from enhancing information veracity in real-time interactions to improving the quality of generated content across digital platforms.

PDF Markdown

Related Papers

GitHub

GitHub - ictnlp/TruthX: TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space (146 stars)