Adaptive Semiparametric Language Models (2102.02557v1)

Published 4 Feb 2021 in cs.CL

Abstract: We present a LLM that combines a large parametric neural network (i.e., a transformer) with a non-parametric episodic memory component in an integrated architecture. Our model uses extended short-term context by caching local hidden states -- similar to transformer-XL -- and global long-term memory by retrieving a set of nearest neighbor tokens at each timestep. We design a gating function to adaptively combine multiple information sources to make a prediction. This mechanism allows the model to use either local context, short-term memory, or long-term memory (or any combination of them) on an ad hoc basis depending on the context. Experiments on word-based and character-based LLMing datasets demonstrate the efficacy of our proposed method compared to strong baselines.

Citations (95)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Related Papers

Memory Transformer (2020)
Augmenting Language Models with Long-Term Memory (2023)
Recurrent Memory Transformer (2022)
LaMemo: Language Modeling with Look-Ahead Memory (2022)
Long-Short Range Context Neural Networks for Language Modeling (2017)