Pretrained Transformers as Universal Computation Engines (2103.05247v2)

Published 9 Mar 2021 in cs.LG and cs.AI

Abstract: We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning -- in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, we show that pretraining on natural language can improve performance and compute efficiency on non-language downstream tasks. Additionally, we perform an analysis of the architecture, comparing the performance of a random initialized transformer to a random LSTM. Combining the two insights, we find language-pretrained transformers can obtain strong performance on a variety of non-language tasks.

Citations (210)

View on Semantic Scholar

Summary

The paper introduces Frozen Pretrained Transformers (FPT) that generalize across language, vision, and numerical tasks.
The paper uses a minimal finetuning strategy by adjusting only a few layers while keeping self-attention and feedforward layers unchanged.
The paper demonstrates strong results, including perfect accuracy in bit memory and XOR tasks, and competitive performance in image classification.

Overview of "Pretrained Transformers as Universal Computation Engines"

The paper explores the application of pretrained transformer models as general-purpose computation engines across multiple non-language modalities. Specifically, the authors introduce a concept called the Frozen Pretrained Transformer (FPT), wherein a transformer is pretrained on natural language tasks and subsequently finetuned on diverse sequence classification tasks without altering its self-attention and feedforward layers.

Key Contributions

Cross-Modality Transferability: The central hypothesis is that transformers pretrained on abundant language data can generalize to non-language tasks such as numerical computation, vision, and protein folding. This transferability challenges traditional approaches where the same modality is used for both pretraining and finetuning.
Minimal Finetuning Strategy: The paper introduces a strategy where only a few layers are finetuned—input, output, layer norm, and positional embeddings—while the bulk of the transformers, especially self-attention and feedforward layers, remain untouched. This suggests inherent universal computation capabilities embedded in the transformer architecture.
Performance Evaluation: Extensive empirical results demonstrate that FPT achieves strong performance across various tasks compared to both fully trained transformers and traditional LSTMs. For example, FPT achieves perfect accuracy in bit memory and XOR tasks and performs competitively on image classification tasks.

Methodology and Results

Experimental Setup: The authors test FPT across several classification tasks: Bit Memory, Bit XOR, ListOps, MNIST, CIFAR-10, CIFAR-10 LRA, and Remote Homology. The tasks are designed to evaluate the model's ability to understand and generalize sequences of different modalities.
Comparative Analysis: Results indicate that FPT models significantly outperform LSTM models, particularly on tasks involving longer sequence lengths. They also often match the performance of fully trained transformers, highlighting the pretrained model's robustness and versatility.
Pretraining Effects: Language-pretrained transformers show better performance and faster convergence compared to random initializations or pretraining on non-language tasks, emphasizing the effectiveness of language as a pretraining modality.
Impacts of Architecture: The paper finds that the nature of computation performed by self-attention layers contributes to the transferability across tasks. Moreover, architecture adjustments, like adding residual connections, improve LSTM models but still do not reach the transformer’s performance.

Implications and Future Directions

The findings of this paper have significant implications for the development of AI systems capable of learning from and operating across multiple domains. The demonstrated ability of LLMs to serve as universal computation engines opens pathways for leveraging multimodal information more effectively.

Future research could focus on:

Automating and optimizing the pretraining process by incorporating multiple data-rich modalities.
Developing multimodal architectures that more seamlessly integrate with pretrained models.
Exploring the theoretical underpinnings that enable such cross-task transferability within transformer models.

The paper provides a compelling case for expanding the applications of pretrained transformers beyond their traditional language-focused boundaries, potentially leading to more efficient and capable AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Ethan_smith_20/status/1792024324464857197

https://twitter.com/Jacoed/status/1772955923398475997

https://twitter.com/Mo7sen_83/status/1793781680672481284

https://twitter.com/PDillis/status/1854123811379044603

https://twitter.com/AIScienceCenter/status/1792244614033702961

https://twitter.com/0xmaddie_/status/1753429642701062206

YouTube

Show All Videos

HackerNews

Pretrained Transformers as Universal Computation Engines (2 points, 0 comments)