Disentangling Dense Embeddings with Sparse Autoencoders (2408.00657v2)

Published 1 Aug 2024 in cs.LG

Abstract: Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks. We present one of the first applications of SAEs to dense text embeddings from LLMs, demonstrating their effectiveness in disentangling semantic concepts. By training SAEs on embeddings of over 420,000 scientific paper abstracts from computer science and astronomy, we show that the resulting sparse representations maintain semantic fidelity while offering interpretability. We analyse these learned features, exploring their behaviour across different model capacities and introducing a novel method for identifying ``feature families'' that represent related concepts at varying levels of abstraction. To demonstrate the practical utility of our approach, we show how these interpretable features can be used to precisely steer semantic search, allowing for fine-grained control over query semantics. This work bridges the gap between the semantic richness of dense embeddings and the interpretability of sparse representations. We open source our embeddings, trained sparse autoencoders, and interpreted features, as well as a web app for exploring them.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces sparse autoencoders that extract interpretable features from dense text embeddings.
The paper trains models on over 420,000 abstracts and demonstrates power-law scaling in feature accuracy.
The paper establishes feature families that enhance semantic search by enabling fine-grained query manipulation.

Disentangling Dense Embeddings with Sparse Autoencoders

The paper "Disentangling Dense Embeddings with Sparse Autoencoders" explores the application of Sparse Autoencoders (SAEs) to dense text embeddings derived from LLMs. This work primarily addresses the interpretation challenges associated with dense embeddings by introducing sparse representations that maintain semantic fidelity while being interpretable.

Key Contributions and Methodology

The research focuses on several key contributions:

Training SAEs for Dense Text Embeddings: The authors trained SAEs on embeddings of over 420,000 abstracts from computer science and astronomy papers. The embeddings were generated using OpenAI's text-embedding-3-small model. Varying the number of active latents (k) and total latents (n), they demonstrated that SAEs could extract interpretable features from dense embeddings.
Comprehensive Feature Analysis: A thorough evaluation of the learned features was performed to assess their interpretability, behavior across different model capacities, and semantic properties. This includes both qualitative and quantitative analyses.
Introduction of Feature Families: A novel method was introduced to identify "feature families"—clusters of related features at varying abstraction levels. This clustering allows for the multi-scale semantic analysis and manipulation of the embeddings.
Practical Utility in Semantic Search: Demonstrating the practical application of SAEs, the paper shows how these interpretable features can be used to steer semantic search, offering fine-grained control over query semantics. The developed system and its outputs were made publicly available.

Training and Structure of SAEs

The authors utilized several hyperparameters for training their SAEs, including the number of active latents (k), the total number of latents (n), and auxiliary losses to revive dead features. The paper focused on configurations such as SAE16 (k=16, n=2d_input=3072), SAE32 (k=32, n=4d_input=6144), and SAE64 (k=64, n=6d_input=9216).

Key observations included:

Power-law scaling for the performance of SAEs as a function of n, k, and compute used for training. A notable finding was the precise power-law scaling of normalised mean squared error (MSE) with parameters.
High interpretability of features as demonstrated by strong Pearson correlations between the predicted and actual activations of features. This correlation increased slightly with smaller models due to features being more coarse-grained and thus easier to interpret.

Feature Families and Their Analysis

Feature families are hierarchical clusters of features sharing a common, broader semantic theme. The identification process involved:

Constructing co-occurrence and activation similarity matrices.
Building a maximum spanning tree (MST) to capture the strongest relationships, subsequently converting this tree into a directed graph for hierarchical analysis.
Iteratively identifying feature families by removing parent features after each iteration to reveal overlapping and finer-grained families.

Practical Applications in Semantic Search

The utility of these interpretable features was showcased in a semantic search system. By modifying the query embedding based on the extracted features, the system allows for precise semantic search adjustments. This capability was compared against traditional query rewriting methods, revealing that SAE-based interventions can achieve higher intervention accuracy while maintaining high query fidelity.

Implications and Future Directions

The implications of this research are vast, spanning both practical applications and theoretical developments in the field of AI:

Practical Implications: The improved interpretability of dense embeddings can enhance tasks like text classification, machine translation, and semantic search by providing fine-grained control and better explainability.
Theoretical Implications: This work bridges the gap between dense and sparse representations, offering insights into how semantic information can be disentangled and interpreted more effectively.

Future research directions may focus on scaling this approach to larger and more diverse datasets, extending the analysis to general-purpose text embeddings, and further improving the automated interpretability techniques. Additionally, conducting evaluations on standard benchmarks for semantic embedding could provide a more comprehensive understanding of the reconstruction capabilities of SAEs.

Conclusion

This paper represents a significant step in making dense text embeddings more interpretable and controllable using sparse autoencoders. By successfully demonstrating the utility of SAEs in extracting and manipulating semantic features, the authors contribute valuable insights into the evolving landscape of natural language processing and representation learning. The open-sourcing of their systems and tools provides an excellent resource for further exploration and development in this field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/charles0neill/status/1820458248245551123

https://twitter.com/jwuphysics/status/1819726394881904695

https://twitter.com/jwuphysics/status/1907820855003902411

YouTube

Show All Videos