Papers
Topics
Authors
Recent
2000 character limit reached

G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks

Published 7 Dec 2022 in cs.CL | (2212.03613v3)

Abstract: Recently, domain-specific PLMs have been proposed to boost the task performance of specific domains (e.g., biomedical and computer science) by continuing to pre-train general PLMs with domain-specific corpora. However, this Domain-Adaptive Pre-Training (DAPT; Gururangan et al. (2020)) tends to forget the previous general knowledge acquired by general PLMs, which leads to a catastrophic forgetting phenomenon and sub-optimal performance. To alleviate this problem, we propose a new framework of General Memory Augmented Pre-trained LLM (G-MAP), which augments the domain-specific PLM by a memory representation built from the frozen general PLM without losing any general knowledge. Specifically, we propose a new memory-augmented layer, and based on it, different augmented strategies are explored to build the memory representation and then adaptively fuse it into the domain-specific PLM. We demonstrate the effectiveness of G-MAP on various domains (biomedical and computer science publications, news, and reviews) and different kinds (text classification, QA, NER) of tasks, and the extensive results show that the proposed G-MAP can achieve SOTA results on all tasks.

Citations (14)

Summary

  • The paper demonstrates that integrating frozen general PLM memory using a memory-attention mechanism effectively prevents catastrophic forgetting in domain-specific tasks.
  • It introduces multiple memory-transfer strategies, including single-layer, multiple-layer, gated, and chunk-based methods, with the chunk-based approach showing superior performance.
  • Experimental results across text classification, QA, and NER tasks confirm that G-MAP enhances domain adaptation while preserving general knowledge in NLP.

Summary of "G-MAP: General Memory-Augmented Pre-trained LLM for Domain Tasks"

Introduction and Motivation

The paper "G-MAP: General Memory-Augmented Pre-trained LLM for Domain Tasks" addresses the issue of catastrophic forgetting in domain-adaptive pre-training (DAPT) within pre-trained LLMs (PLMs). Conventional approaches to enhancing domain-specific performance, such as DAPT, involve further pre-training PLMs on domain-specific corpora. However, this often leads to the degradation of general knowledge, impairing performance on tasks that require such information. Figure 1

Figure 1

Figure 1

Figure 1: Masked LM (MLM) loss of RoBERTa on 50K randomly sampled documents from each domain before and after DAPT, illustrating catastrophic forgetting.

To tackle this problem, the authors propose a framework called General Memory-Augmented Pre-trained LLM (G-MAP), which augments domain-specific PLMs with memory representations derived from frozen general PLMs, thereby preserving general domain knowledge. This memory-augmented model aims to enhance the generalization capabilities of domain-specific PLMs without additional training of the general PLM.

Methodological Framework

G-MAP Architecture

The core concept of G-MAP revolves around integrating general knowledge from a frozen PLM into a domain-specific PLM through a novel memory-augmented layer. This integration involves a memory-attention mechanism that adaptively fuses general and domain-specific representations. Figure 2

Figure 2: A framework of G-MAP with the cs-domain task input.

Memory-Augmented Strategies

The paper explores several strategies for constructing and transferring memory representations:

  1. Single-Layer Memory Transfer: This approach incorporates the last hidden state of the general PLM as a memory cache into a single layer of the domain-specific PLM.
  2. Multiple-Layer Memory Transfer: Here, all hidden states from the general PLM are transferred to corresponding layers in the domain-specific PLM.
  3. Gated Memory Transfer: Using a gating mechanism, layer-wise representations are adaptively combined and fused into one layer of the domain-specific PLM.
  4. Chunk-based Gated Memory Transfer: Based on chunking observations, memory representations are divided into high-level and low-level chunks, which are transferred into layers in the domain-specific PLM accordingly. Figure 3

    Figure 3: Memory-augmented strategies of the G-MAP framework illustrating layer interactions.

Results and Experimental Analysis

G-MAP demonstrates significant improvements across various domain-specific tasks including text classification, QA, and NER. Particularly, the chunk-based gated memory transfer strategy surpasses other methods due to its efficient utilization of layer-wise token-level information.

Experimental results highlight the effectiveness of leveraging memory from frozen PLMs:

  • Domain Classification Tasks: G-MAP outperforms traditional DAPT approaches in diverse domains such as biomedical sciences, computer sciences, news, and review datasets.
  • Application of TAPT: The proposed framework also achieves superior performance when combined with task-adaptive pre-training (TAPT). Figure 4

    Figure 4: Performance of different layer selections in chunk-based gate memory transfer strategy.

Further Discussion

The study explores the effectiveness of frozen memory, demonstrating that unfreezing memory affects performance and training efficiency adversely. Moreover, comparisons with other attention-based mechanisms establish the superiority of memory-attention due to its parameter efficiency.

Pre-training Stage Application

Applying G-MAP during pre-training shows reduced masked LM loss, suggesting its utility in preserving general knowledge during domain adaptation. Figure 5

Figure 5

Figure 5

Figure 5: Masked LM loss for the pre-training stage across domains with G-MAP, indicating lower values are better.

Conclusion

The G-MAP framework presents a promising solution to catastrophic forgetting in DAPT scenarios by incorporating memory representations that preserve general knowledge. Its flexibility extends across numerous NLP tasks and domains, with empirical evidence showcasing its superior performance compared to conventional models.

In summary, G-MAP successfully mitigates forgetting while boosting generalization capability, setting a new benchmark for augmenting domain-specific PLMs. Future work could explore large-scale pre-training with G-MAP and investigate automatic layer selection methodologies to enhance performance further. Figure 6

Figure 6

Figure 6: Performance of different layer-selection indexes in memory-attention strategies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.