Reverse Transfer Learning: Can Word Embeddings Trained for Different NLP Tasks Improve Neural Language Models? (1909.04130v1)

Published 9 Sep 2019 in cs.CL

Abstract: Natural language processing (NLP) tasks tend to suffer from a paucity of suitably annotated training data, hence the recent success of transfer learning across a wide variety of them. The typical recipe involves: (i) training a deep, possibly bidirectional, neural network with an objective related to language modeling, for which training data is plentiful; and (ii) using the trained network to derive contextual representations that are far richer than standard linear word embeddings such as word2vec, and thus result in important gains. In this work, we wonder whether the opposite perspective is also true: can contextual representations trained for different NLP tasks improve language modeling itself? Since LMs are predominantly locally optimized, other NLP tasks may help them make better predictions based on the entire semantic fabric of a document. We test the performance of several types of pre-trained embeddings in neural LMs, and we investigate whether it is possible to make the LM more aware of global semantic information through embeddings pre-trained with a domain classification model. Initial experiments suggest that as long as the proper objective criterion is used during training, pre-trained embeddings are likely to be beneficial for neural language modeling.

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates that reverse transfer learning can enhance language model performance when pre-training objectives align with the LM task.
It uses diverse embeddings—including in-domain word2vec and bidirectional LSTM LM embeddings—to assess their impact on perplexity.
Findings show that context-focused, in-domain embeddings outperform out-of-domain and global semantic models, underscoring the importance of objective congruence.

Reverse Transfer Learning: Assessing the Efficacy of Task-Specific Word Embeddings in Neural LLMs

This paper presents a systematic evaluation of reverse transfer learning for neural LMs by investigating whether pre-trained word embeddings, derived from objectives other than language modeling (e.g., domain classification), can enhance LM quality. The authors contrast this with embeddings pre-trained using objectives closely related to language modeling, such as word2vec and bidirectional LSTM LMs, to discern the relative impact of global semantic awareness versus congruent training objectives on downstream language modeling.

Methodological Overview

The experimental framework centers on replacing the standard randomly initialized word embedding layer in a one-layer LSTM LM (128 units, Adam optimizer) with various pre-trained embedding matrices. These include:

word2vec Embeddings: Trained both on a large in-domain corpus (5B words) and a larger out-of-domain corpus (Google News, 100B words).
Bidirectional LSTM LM Embeddings: Derived from sentence-level training on the LM corpus, averaged over all occurrences to yield context-independent representations.
Domain Classifier Embeddings: Learned via a bidirectional LSTM trained to predict the semantic domain of paragraphs or sentences, thereby encapsulating global semantic attributes.

The primary metric for evaluation is perplexity (PPL) on a diverse, in-house 10M-word dataset. All pre-trained embeddings are kept fixed during the LM training phase for clarity of transfer effects.

Empirical Results

A summary of key numerical findings:

In-domain word2vec embeddings reduce LM perplexity from 189 (random embedding baseline) to 162, constituting a 14% relative improvement.
Out-of-domain word2vec embeddings (Google News) do not yield perplexity gains, and in fact slightly worsen performance (PPL 195).
Domain classifier embeddings consistently fail to improve perplexity, with PPL scores ranging from 239 to 255, significantly lagging behind even the baseline.
Bidirectional LSTM LM embeddings, pre-trained directly on the LM task, deliver modest perplexity reductions (PPL 185), showing greater promise than domain embeddings despite using less data.

Analysis and Interpretation

A salient outcome is the confirmation that the transferability of pre-trained embeddings to LMs is highly sensitive to the alignment of objectives between the pre-training phase and the language modeling task. Embeddings optimized for predicting local lexical context—either via word2vec or direct language modeling—retain their utility within LMs. In contrast, embeddings emphasizing global semantics (e.g., domain classifier representations) do not translate into perplexity gains and, in most configurations, are detrimental.

This result is robust against variations in embedding source data size and normalization. Notably, the failure of domain classifier embeddings remains despite architectural modifications and value normalization, suggesting a fundamental limitation in using global, label-driven representations for a quintessentially local prediction task like next-word prediction.

The failure of massive, out-of-domain embeddings (Google News word2vec) to assist, and occasionally to impair, reflects that data distribution congruence is critical; larger but mismatched corpora do not compensate for domain and style divergence.

Theoretical and Practical Implications

The findings reinforce several practical and theoretical perspectives:

Objective congruence between embedding pre-training and LM tasks is critical for effective transfer; mere global semantic encoding is insufficient for improving local prediction tasks.
Domain and data alignment take precedence over embedding model scale; practitioners should prioritize in-domain pre-training for task-specific LMs.
Reverse transfer learning—using embeddings from other NLP tasks to assist LMs—appears limited under the conventional next-word prediction paradigm unless embedding objectives are adapted to the LM’s needs.

In practical system building, this suggests that efforts to incorporate richer semantic context into LMs should focus on model architectures explicitly engineered to utilize global signals, possibly in multi-task or auxiliary objective setups, rather than relying solely on imported embeddings from disparate tasks.

Future Directions

The paper advocates for the exploration of multi-task learning strategies where language modeling and semantic supervision (e.g., domain labels) are jointly optimized. Such an approach may allow LMs to internalize global semantic information more organically and yield broader improvements across both core language modeling and downstream tasks. Additionally, the observed improvements from bidirectional LM embeddings with limited data hint at the potential gains from scaling in-domain, task-congruent pre-training.

Given these findings, it is plausible that deep integration of global and local objectives within LM architectures, potentially with adaptive context windows or hierarchical models, will be more fruitful than reverse transfer solely at the embedding layer.

Concluding Remarks

This work provides a careful empirical baseline and nuanced interpretation for the utility of pre-trained embeddings in neural LLMs, decisively clarifying the present constraints of reverse transfer learning for LMs and setting the stage for future integration of semantic globalist signals within next-word prediction systems. The numerical outcomes support the assertion that the success of transfer learning in NLP remains deeply dependent on the task, data, and objective congruence, even as larger and more diverse embedding models proliferate.