Locating and Editing Factual Associations in GPT (2202.05262v5)
Abstract: We analyze the storage and recall of factual associations in autoregressive transformer LLMs, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model's factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. To test our hypothesis that these computations correspond to factual association recall, we modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). We find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task, comparable to existing methods. To perform a more sensitive evaluation, we also evaluate ROME on a new dataset of counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another. Our results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing. The code, dataset, visualizations, and an interactive demo notebook are available at https://rome.baulab.info/
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Explaining “Locating and Editing Factual Associations in GPT”
Overview
This paper asks a simple question: where does a LLM like GPT keep its facts (like “The Space Needle is in Seattle”), and can we change them directly without retraining the whole model? The authors show that certain “middle parts” of GPT act like small memory boxes that store facts, and they introduce a method called ROME to safely edit those facts.
Key Questions
The paper focuses on three easy-to-understand questions:
- Where inside GPT are facts stored?
- Which parts of GPT are most important when the model remembers a fact?
- Can we directly and precisely edit a specific fact (for example, changing a wrong fact to the right one) without breaking other knowledge?
How They Did It (Methods in Everyday Terms)
Think of GPT as a very long recipe with many steps (called layers). Each step mixes information using two main tools:
- Attention: looks back at earlier words to copy or combine information.
- MLP (a small feed-forward network): a local calculator that changes the current hidden signal.
The authors use two main approaches, explained with simple analogies:
- Causal Tracing (finding which steps matter)
- Analogy: Imagine you’re baking a cake. You run through the recipe once normally, then again but you “corrupt” or scramble the ingredient related to the subject (like changing “Space Needle” to random noise).
- Next, you fix only one specific step’s internal signal and see if the cake (the final answer) turns out right again.
- By repeating this, they discover which steps are “decisive” for getting the correct fact.
- Result: The most important steps for recalling facts are in middle layers, inside the MLP, especially when processing the last token of the subject (for “Space Needle,” that would be the final part of the name).
- ROME (Rank-One Model Editing: changing the stored fact)
- Analogy: Think of a shelf of labeled drawers (the MLP). Each drawer has a key (how the subject is represented inside the model) and a value (the facts tied to that subject).
- ROME treats one MLP layer like a simple key–value memory. It:
- Finds the “key” by reading the model’s internal representation at the last token of the subject across different short contexts (to make it robust).
- Figures out the “value” (a vector) that makes the model prefer the new object you want (e.g., if you want to say “Space Needle → Paris,” it chooses a value that leads the model to answer “Paris”).
- Applies a tiny, targeted weight change (a “rank-one” update—like adding one sticky note to a single drawer) that inserts the new key–value pair with minimal disruption to other drawers.
Main Findings and Why They Matter
The authors test ROME on both a standard benchmark and a new, tougher dataset they created:
- Where facts live:
- Strong evidence shows that mid-layer MLPs (not just attention) are crucial for recalling facts about a subject, especially at the last subject token.
- Attention is more important just before the final prediction, often to copy or gather information, but the “fact recall” seems to happen in those middle MLPs.
- Editing facts works:
- On the zsRE benchmark (a common test for changing factual relations), ROME performs as well as or better than other methods.
- On CounterFact (a new dataset of hard, counterfactual edits), ROME stands out because it:
- Efficacy: Successfully changes the target fact.
- Generalization: Keeps working when you rephrase the question in different ways.
- Specificity: Doesn’t bleed into nearby facts about similar subjects (it doesn’t break other related knowledge).
- Many other methods either overfit (only work for one exact wording) or underfit (change too much and break other facts). ROME strikes the right balance.
- Human evaluation:
- People found ROME’s outputs more consistent with the edited fact compared to a strong baseline.
- They noted a small drop in fluency compared to one method (FT+L), meaning the text was sometimes slightly less smooth, but the facts were more reliably updated.
Implications and Impact
- Understanding models: The study gives a clearer picture of how GPT recalls facts—middle-layer MLPs play a key role, and the last token of the subject is a critical moment.
- Direct, precise edits: ROME shows you can surgically edit a single fact inside a giant model without retraining and without causing lots of unwanted side effects.
- Practical use: This can help quickly fix incorrect facts in a model, improve transparency, and reduce the cost and time of retraining.
- Limits and caution:
- ROME edits one fact at a time and treats some relations as directional (you might need two edits to change both “Seattle → Space Needle” and “Space Needle → Seattle”).
- The method is designed for understanding and small fixes, not full-scale training.
- Editing models must be used responsibly. It’s possible to insert misleading information, so LLMs should not be trusted as authoritative sources in critical settings.
Overall, the paper shows that facts in GPT aren’t scattered randomly—they’re stored in a structured way that we can find and carefully edit. This opens the door to safer, more interpretable, and more controllable LLMs.
Collections
Sign up for free to add this paper to one or more collections.