Pre-Training a Language Model Without Human Language (2012.11995v1)

Published 22 Dec 2020 in cs.CL

Abstract: In this paper, we study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance. To this end, we pre-train different transformer-based masked LLMs on several corpora with certain features, and we fine-tune those LLMs on GLUE benchmarks. We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks. Our results also show that pre-training on structured data does not always make the model acquire ability that can be transferred to natural language downstream tasks. To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.

Citations (13)

View on Semantic Scholar

Summary

The paper demonstrates that pre-training on non-human language data imparts transferable structural skills that enhance downstream NLU tasks.
It reveals that unstructured and artificial hierarchical datasets can rival traditional human language data, challenging prevailing assumptions.
Findings highlight that the variety of token embeddings plays a crucial role in developing flexible, high-performing masked language models.

Analysis of Pre-Training LLMs Without Human Language

The paper, "Pre-Training a LLM Without Human Language," authored by Cheng-Han Chiang and Hung-yi Lee, challenges traditional paradigms in pre-training LLMs (LMs) by exploring the possibility of leveraging non-human language data. The authors focus on how the intrinsic characteristics of pre-training datasets influence downstream task performance, specifically using transformer-based masked LLMs (MLMs). They assess whether models pre-trained on non-human language datasets can perform competitively when fine-tuned on natural language understanding (NLU) tasks, as evaluated by the GLUE benchmarks.

Key Insights and Findings

This research provides several insights into the relationship between pre-training data and downstream performance:

Advantage of Pre-training on Unstructured Data: The paper reveals that models pre-trained on unstructured, non-linguistic datasets outperform those trained from scratch on downstream tasks. This finding suggests that the process of pre-training, even on seemingly irrelevant datasets, imbues the models with certain transferable skills advantageous for downstream tasks.
Challenges with Structured Data: Contrary to expectations, structured data sets, such as amino acid sequences and programming code, do not necessarily enhance the performance of downstream tasks. This finding challenges the prevalent assumption that structured datasets inherently lead to better pre-training outcomes.
Comparable Results from Artificial Datasets: An intriguing discovery is that pre-training on artificial hierarchical datasets can yield performance levels similar to those achieved with another human language, Kannada. This suggests that hierarchical structures, rather than semantic understanding, are key skills acquired during pre-training that benefit transfer to downstream tasks.
Token Distribution and Vocabulary Size: The distribution of tokens in the pre-training dataset appears to have minimal impact on transfer learning performance. However, the number of token embeddings used during pre-training significantly affects outcomes. A smaller token variety limits the model's flexibility in downstream tasks, although certain manipulations can mitigate this.

Implications for Future AI Developments

The findings have both theoretical and practical implications for future AI research and applications:

Decoupling Semantic Understanding from Structural Learning: The results underscore the potential to decouple semantic knowledge from structural learning in LMs. Future research could further elucidate which facets of LLMs derive from structural versus semantic pre-training.
Resource Optimization: With implications for low-resource languages or domains lacking extensive corpora, this paper suggests alternative pre-training strategies for leveraging non-linguistic data or artificially generated datasets to bootstrap initial learning.
Refined Pre-training Strategies: Given the limited impact of token distribution, model pre-training strategies might focus more on structural elements within data, unifying NLP tasks where syntactic understanding is paramount.

Conclusion

Chiang and Lee's work contributes to a nuanced understanding of the role of pre-training data in masked LLMs. By demonstrating that non-human language pre-training can be surprisingly effective, it invites further investigation into unconventional pre-training sources, which may simultaneously streamline computational requirements and broaden the applicability of LLMs across diverse linguistic scenarios. Future work could expand on these insights by examining how these findings generalize to other architectures or languages, potentially revolutionizing approaches to multilingual and resource-scarce NLP environments.

PDF Markdown

Related Papers

YouTube

Show All Videos