An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L (2310.07325v4)

Published 11 Oct 2023 in cs.LG and cs.AI

Abstract: Prior work suggests that LLMs manage the limited bandwidth of the residual stream through a "memory management" mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our study provides concrete evidence for this erasure phenomenon in a 4-layer transformer, identifying heads that consistently remove the output of earlier heads. We further demonstrate that direct logit attribution (DLA), a common technique for interpreting the output of intermediate transformer layers, can show misleading results by not accounting for erasure.

Citations (5)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L (2310.07325v4)

Summary

Related Papers