Discovering Latent Knowledge in Language Models Without Supervision (2212.03827v2)

Published 7 Dec 2022 in cs.CL, cs.AI, and cs.LG

Abstract: Existing techniques for training LLMs can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a LLM in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in LLMs: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what LLMs know, distinct from what they say, even when we don't have access to explicit ground truth labels.

Citations (262)

View on Semantic Scholar

Summary

The paper introduces Contrast-Consistent Search (CCS) to extract latent truth from internal activations without human supervision.
It demonstrates a 4% improvement over zero-shot accuracy and maintains performance even against misleading prompts.
The method generalizes across tasks, highlighting its potential for reliable applications in critical domains like medicine and law.

Discovering Latent Knowledge in LLMs

Introduction

LLMs, like GPT-3 and BERT, are widely used in various applications, such as chatbots, machine translation, and sentiment analysis. However, these models can sometimes generate text that is not entirely truthful. This issue arises from misalignment between their training objectives and the truth. For instance, a model trained to imitate human text may reproduce common misconceptions, or a model trained to optimize for engagement might generate compelling but false information.

A paper proposes a unique method for addressing this problem by directly finding latent knowledge within the internal activations of a LLM, instead of relying on model outputs or human supervision. This approach, called Contrast-Consistent Search (CCS), aims to accurately answer yes-no questions by identifying a direction in activation space that satisfies logical consistency properties.

Problem Statement and Framework

Discovering Latent Knowledge

The central problem tackled by the paper is to answer yes-no questions using only the internal hidden representations (activations) of a pre-trained LLM, without relying on model outputs or external supervision. The goal is to determine whether the internal activations of these models contain usable knowledge about the truth of certain statements.

Method: Contrast-Consistent Search (CCS)

The CCS method operates by finding a linear projection in the activation space of LLMs that reflects truth. Here’s how it works:

Construct Contrast Pairs: For each yes-no question, generate two natural language answers, one stating "Yes" and the other "No."
Feature Extraction and Normalization: Obtain and normalize the hidden activations for both answers.
Mapping Activations to Probabilities: Learn a mapping that transforms these activations into probabilities of the answers being true, ensuring consistency and confidence in these probabilities.
Optimization: Train this mapping using an unsupervised loss that encourages logical consistency (i.e., the sum of probabilities for a statement and its negation should be 1) and high confidence in these probabilities.
Inference: Use the trained mapping to answer new questions by averaging the probabilities of "Yes" and "No" answers and choosing the higher one.

Experimental Results

CCS was evaluated on six different LLMs (like T5, UnifiedQA, and GPT-J) using 10 diverse datasets covering tasks such as sentiment classification and natural language inference. Here are some notable results and findings:

Performance: CCS outperformed the zero-shot accuracy of these models by an average of 4%. For instance, on UnifiedQA, the average zero-shot accuracy was 80.4%, while CCS achieved 82.1%.
Robustness: CCS was less sensitive to different prompts and maintained high accuracy even when the models were deliberately misled with incorrect prompt prefixes. For example, in UnifiedQA, despite a 9.5% drop in zero-shot accuracy due to misleading prompts, CCS's accuracy remained stable.
Transferability: CCS exhibited strong performance even when transferred across different tasks, indicating that CCS might be discovering a task-agnostic representation of truth within the models.

Implications

Practical Implications

With the ability to uncover latent truth in LLMs, CCS could be crucial for applications where the accuracy and truthfulness of model outputs are paramount. This includes areas like medical diagnosis, legal document analysis, and automated fact-checking, where incorrect information can have significant consequences.

Theoretical Implications

The results suggest that pretrained LLMs might inherently develop internal representations that align with truth, even though their outputs might sometimes be false. This could open new research avenues in understanding how these internal representations develop and how they can be leveraged for various tasks.

Future Directions

The framework presented can be expanded and refined in numerous ways:

Additional Consistency Constraints: Incorporating more sophisticated logical constraints could further improve the method's ability to detect truths.
Generalization Beyond Yes-No Questions: Extending the method to handle more complex types of questions and statements.
Calibration and Robustness: Enhancing the calibration of the probabilistic outputs and further improving the method's robustness against adversarial prompts.

Conclusion

Contrast-Consistent Search (CCS) provides an innovative way to discover latent knowledge within LLMs without relying on model outputs or external supervision. The strong empirical results demonstrate that this method can exceed zero-shot performance, maintain robustness against misleading information, and find task-agnostic truth representations in model activations. This approach opens the door to new ways of ensuring the reliability and truthfulness of AI systems without constant human oversight.

Related Papers

Tweets

https://twitter.com/jd_pressman/status/1847946701169942750

https://twitter.com/CollinBurns4/status/1782922200359879118

https://twitter.com/peligrietzer/status/1813654316173128044

https://twitter.com/Ethan_smith_20/status/1782937427474030907

https://twitter.com/icy_chips/status/1890839096483799408

https://twitter.com/igypham/status/1775294701190418805

YouTube

Show All Videos