Emergent Mind

Discovering Latent Knowledge in Language Models Without Supervision

(2212.03827)
Published Dec 7, 2022 in cs.CL , cs.AI , and cs.LG

Abstract

Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in LLMs: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.

Contrast-Consistent Search (CCS) method: determining true statements from yes-no questions with probability mapping.

Overview

  • The paper introduces a novel method called Contrast-Consistent Search (CCS) for discovering latent knowledge within the internal activations of pre-trained language models, enabling the accurate answering of yes-no questions without relying on model outputs or human supervision.

  • The CCS method involves constructing contrast pairs, normalizing hidden activations, mapping them to probabilities, and training this mapping to ensure logical consistency and high confidence in probabilities, leading to improved accuracy and robustness in language models.

  • Experimental results show that CCS outperforms zero-shot accuracy across various language models and datasets, maintains stability against misleading prompts, and exhibits strong transferability across different tasks, highlighting its potential for applications requiring accuracy and truthfulness in model outputs.

Discovering Latent Knowledge in Language Models

Introduction

Language models, like GPT-3 and BERT, are widely used in various applications, such as chatbots, machine translation, and sentiment analysis. However, these models can sometimes generate text that is not entirely truthful. This issue arises from misalignment between their training objectives and the truth. For instance, a model trained to imitate human text may reproduce common misconceptions, or a model trained to optimize for engagement might generate compelling but false information.

A paper proposes a unique method for addressing this problem by directly finding latent knowledge within the internal activations of a language model, instead of relying on model outputs or human supervision. This approach, called Contrast-Consistent Search (CCS), aims to accurately answer yes-no questions by identifying a direction in activation space that satisfies logical consistency properties.

Problem Statement and Framework

Discovering Latent Knowledge

The central problem tackled by the paper is to answer yes-no questions using only the internal hidden representations (activations) of a pre-trained language model, without relying on model outputs or external supervision. The goal is to determine whether the internal activations of these models contain usable knowledge about the truth of certain statements.

Method: Contrast-Consistent Search (CCS)

The CCS method operates by finding a linear projection in the activation space of language models that reflects truth. Here’s how it works:

  1. Construct Contrast Pairs: For each yes-no question, generate two natural language answers, one stating "Yes" and the other "No."
  2. Feature Extraction and Normalization: Obtain and normalize the hidden activations for both answers.
  3. Mapping Activations to Probabilities: Learn a mapping that transforms these activations into probabilities of the answers being true, ensuring consistency and confidence in these probabilities.
  4. Optimization: Train this mapping using an unsupervised loss that encourages logical consistency (i.e., the sum of probabilities for a statement and its negation should be 1) and high confidence in these probabilities.
  5. Inference: Use the trained mapping to answer new questions by averaging the probabilities of "Yes" and "No" answers and choosing the higher one.

Experimental Results

CCS was evaluated on six different language models (like T5, UnifiedQA, and GPT-J) using 10 diverse datasets covering tasks such as sentiment classification and natural language inference. Here are some notable results and findings:

  • Performance: CCS outperformed the zero-shot accuracy of these models by an average of 4%. For instance, on UnifiedQA, the average zero-shot accuracy was 80.4%, while CCS achieved 82.1%.
  • Robustness: CCS was less sensitive to different prompts and maintained high accuracy even when the models were deliberately misled with incorrect prompt prefixes. For example, in UnifiedQA, despite a 9.5% drop in zero-shot accuracy due to misleading prompts, CCS's accuracy remained stable.
  • Transferability: CCS exhibited strong performance even when transferred across different tasks, indicating that CCS might be discovering a task-agnostic representation of truth within the models.

Implications

Practical Implications

With the ability to uncover latent truth in language models, CCS could be crucial for applications where the accuracy and truthfulness of model outputs are paramount. This includes areas like medical diagnosis, legal document analysis, and automated fact-checking, where incorrect information can have significant consequences.

Theoretical Implications

The results suggest that pretrained language models might inherently develop internal representations that align with truth, even though their outputs might sometimes be false. This could open new research avenues in understanding how these internal representations develop and how they can be leveraged for various tasks.

Future Directions

The framework presented can be expanded and refined in numerous ways:

  1. Additional Consistency Constraints: Incorporating more sophisticated logical constraints could further improve the method's ability to detect truths.
  2. Generalization Beyond Yes-No Questions: Extending the method to handle more complex types of questions and statements.
  3. Calibration and Robustness: Enhancing the calibration of the probabilistic outputs and further improving the method's robustness against adversarial prompts.

Conclusion

Contrast-Consistent Search (CCS) provides an innovative way to discover latent knowledge within language models without relying on model outputs or external supervision. The strong empirical results demonstrate that this method can exceed zero-shot performance, maintain robustness against misleading information, and find task-agnostic truth representations in model activations. This approach opens the door to new ways of ensuring the reliability and truthfulness of AI systems without constant human oversight.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube