Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pretrained Transformers as Universal Computation Engines (2103.05247v2)

Published 9 Mar 2021 in cs.LG and cs.AI

Abstract: We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning -- in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, we show that pretraining on natural language can improve performance and compute efficiency on non-language downstream tasks. Additionally, we perform an analysis of the architecture, comparing the performance of a random initialized transformer to a random LSTM. Combining the two insights, we find language-pretrained transformers can obtain strong performance on a variety of non-language tasks.

Citations (210)

Summary

  • The paper introduces Frozen Pretrained Transformers (FPT) that generalize across language, vision, and numerical tasks.
  • The paper uses a minimal finetuning strategy by adjusting only a few layers while keeping self-attention and feedforward layers unchanged.
  • The paper demonstrates strong results, including perfect accuracy in bit memory and XOR tasks, and competitive performance in image classification.

Overview of "Pretrained Transformers as Universal Computation Engines"

The paper explores the application of pretrained transformer models as general-purpose computation engines across multiple non-language modalities. Specifically, the authors introduce a concept called the Frozen Pretrained Transformer (FPT), wherein a transformer is pretrained on natural language tasks and subsequently finetuned on diverse sequence classification tasks without altering its self-attention and feedforward layers.

Key Contributions

  1. Cross-Modality Transferability: The central hypothesis is that transformers pretrained on abundant language data can generalize to non-language tasks such as numerical computation, vision, and protein folding. This transferability challenges traditional approaches where the same modality is used for both pretraining and finetuning.
  2. Minimal Finetuning Strategy: The paper introduces a strategy where only a few layers are finetuned—input, output, layer norm, and positional embeddings—while the bulk of the transformers, especially self-attention and feedforward layers, remain untouched. This suggests inherent universal computation capabilities embedded in the transformer architecture.
  3. Performance Evaluation: Extensive empirical results demonstrate that FPT achieves strong performance across various tasks compared to both fully trained transformers and traditional LSTMs. For example, FPT achieves perfect accuracy in bit memory and XOR tasks and performs competitively on image classification tasks.

Methodology and Results

  1. Experimental Setup: The authors test FPT across several classification tasks: Bit Memory, Bit XOR, ListOps, MNIST, CIFAR-10, CIFAR-10 LRA, and Remote Homology. The tasks are designed to evaluate the model's ability to understand and generalize sequences of different modalities.
  2. Comparative Analysis: Results indicate that FPT models significantly outperform LSTM models, particularly on tasks involving longer sequence lengths. They also often match the performance of fully trained transformers, highlighting the pretrained model's robustness and versatility.
  3. Pretraining Effects: Language-pretrained transformers show better performance and faster convergence compared to random initializations or pretraining on non-language tasks, emphasizing the effectiveness of language as a pretraining modality.
  4. Impacts of Architecture: The paper finds that the nature of computation performed by self-attention layers contributes to the transferability across tasks. Moreover, architecture adjustments, like adding residual connections, improve LSTM models but still do not reach the transformer’s performance.

Implications and Future Directions

The findings of this paper have significant implications for the development of AI systems capable of learning from and operating across multiple domains. The demonstrated ability of LLMs to serve as universal computation engines opens pathways for leveraging multimodal information more effectively.

Future research could focus on:

  • Automating and optimizing the pretraining process by incorporating multiple data-rich modalities.
  • Developing multimodal architectures that more seamlessly integrate with pretrained models.
  • Exploring the theoretical underpinnings that enable such cross-task transferability within transformer models.

The paper provides a compelling case for expanding the applications of pretrained transformers beyond their traditional language-focused boundaries, potentially leading to more efficient and capable AI systems.

Youtube Logo Streamline Icon: https://streamlinehq.com