Emergent Mind

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

(2402.17811)
Published Feb 27, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

LLMs have demonstrated remarkable capabilities across various tasks. However, they sometimes suffer from producing hallucinations, particularly in cases where they may generate untruthful responses despite possessing the correct knowledge. In this paper, we propose TruthX, an inference-time method to elicit the truthfulness of LLMs by editing their internal representations in truthful space. TruthX employs an auto-encoder to map LLM's representations into semantic and truthful latent spaces respectively, and applies contrastive learning to identify a truthful editing direction within the truthful space. During inference, by editing LLM's internal representations in truthful space, TruthX effectively enhances the truthfulness of LLMs. Experiments show that TruthX effectively improves the truthfulness of 13 advanced LLMs by an average of 20% on TruthfulQA benchmark. Further analyses suggest that the truthful space acquired by TruthX plays a pivotal role in controlling LLM to produce truthful or hallucinatory responses.

TruthX maps LLM's representations into truthful spaces, enhancing its truthfulness through probing and editing.

Overview

  • TruthX introduces a novel approach to improve the truthfulness of LLMs by editing their internal representations in a 'truthful space' to distinguish between factual and hallucinatory content.

  • The method employs an auto-encoder structure and uses contrastive learning to decouple LLM internal representations into 'truthful' and 'semantic' latent spaces, aiming to enhance truthfulness without affecting generative capabilities.

  • Experimental results show that TruthX achieves a 20% improvement in truthfulness on the TruthfulQA benchmark across thirteen advanced LLMs, without compromising linguistic fluency or relevance.

  • TruthX not only offers a holistic approach that modifies both attention and feed-forward neural network modules within LLMs but also introduces the concept of 'truthful space' for focused truthfulness editing, demonstrating superior performance over existing techniques.

Enhancing LLM Truthfulness with TruthX: Editing Internal Representations in Truthful Space

Introduction to TruthX

LLMs have grown significantly in prominence, performing a wide array of tasks with noteworthy fluency and comprehension. Despite these advancements, LLMs are prone to generating responses that are not anchored in truth, a phenomenon typically known as "hallucination." Addressing this challenge, we introduce TruthX, a novel method designed to enhance the truthfulness of LLMs. TruthX operates by editing LLMs' internal representations in a domain we define as the "truthful space." This space is meticulously crafted to distinguish between truthful and hallucinatory content, thereby nudging LLM responses towards accuracy.

TruthX: Mechanisms and Techniques

TruthX innovatively employs an auto-encoder structure that decouples LLM internal representations into "truthful" and "semantic" latent spaces. The method utilizes contrastive learning paradigms to explore these spaces, actively identifying the direction that enhances truthfulness without compromising the model's inherent generative capabilities. During inference, TruthX applies directional edits within the truthful space to adjust an LLM’s responses to be more aligned with factual correctness.

Experimental Validation

Extensive experiments demonstrate that TruthX significantly improves the truthfulness of responses from various LLMs. On the TruthfulQA benchmark, TruthX exhibited an average enhancement of 20% in truthfulness across thirteen advanced LLMs. Additionally, analyses indicate that TruthX preserves the generative capabilities of LLMs, addressing concerns that enhancing truthfulness might lead to diminished linguistic fluency or relevance.

Comparative Advantages and Innovations

Compared to existing truthfulness-enhancement techniques, such as contrast decoding and representation editing, TruthX stands out by:

  • Offering a holistic approach that modifies both attention and feed-forward neural network modules within LLMs.
  • Introducing a novel concept of "truthful space," distinctively separated from semantic considerations, to focus purely on truthfulness editing.
  • Demonstrating superior performance in truthfulness enhancement without negatively impacting the LLM's ability to generate coherent and contextually appropriate responses.

Implications and Future Directions

The implications of TruthX are multifaceted, extending beyond improving the reliability of LLM outputs to contributing foundational insights into the workings and optimizations of LLMs. The concept of editing in a domain-specific latent space opens new avenues for AI research, particularly in areas where accuracy and factuality are paramount.

Furthermore, the cross-LLM generalizability of TruthX, especially among sequentially-trained models, demonstrates its broad applicability, potentially paving the way for universal truthfulness-enhancement solutions adaptable across different architectures and applications. Future work will further explore integrating external knowledge sources with internal representation editing to amplify LLM reliability and usefulness across even more diverse scenarios.

In conclusion, TruthX represents a significant step forward in refining the truthfulness of LLM outputs, ensuring that these models not only generate human-like text but also adhere closely to factual accuracy. This advancement holds promise for a wide range of applications, from enhancing information veracity in real-time interactions to improving the quality of generated content across digital platforms.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.