Denoising Autoregressive Representation Learning (2403.05196v2)

Published 8 Mar 2024 in cs.LG and cs.CV

Abstract: In this paper, we explore a new generative approach for learning visual representations. Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively. We find that training with Mean Squared Error (MSE) alone leads to strong representations. To enhance the image generation ability, we replace the MSE loss with the diffusion objective by using a denoising patch decoder. We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models. Notably, the optimal schedule differs significantly from the typical ones used in standard image diffusion models. Overall, despite its simple architecture, DARL delivers performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol. This marks an important step towards a unified model capable of both visual perception and generation, effectively combining the strengths of autoregressive and denoising diffusion models.

References (63)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces DARL, integrating autoregressive prediction with denoising diffusion to achieve near state-of-the-art visual representation learning using MSE loss.
The paper leverages a decoder-only Transformer enhanced by a novel 2D Rotary Positional Embedding to improve autoregressive modeling of image patches.
The paper demonstrates that scaling model size and extending training duration with optimized noise schedules bolsters both representation and generative capabilities.

Exploring DARL: A Unified Model for Visual Representation and Generation

Introduction

In the pursuit of enhancing the capabilities of generative pre-training in computer vision, this paper introduces Denoising Autoregressive Representation Learning (DARL), a novel approach that marries the strengths of autoregressive and denoising diffusion models within a unified architecture. DARL employs a decoder-only Transformer tasked with predicting image patches autoregressively. Remarkably, this work demonstrates that performance closely aligns with state-of-the-art solutions, even under simple training regimes such as using Mean Squared Error (MSE) loss. Furthermore, by incorporating diffusion-based objectives, DARL subtly enhances its generative prowess, signaling an exciting progression toward versatile models capable of both sophisticated visual perception and generation.

Key Contributions

The paper makes several noteworthy contributions to the field of visual representation learning:

Denoising Autoregressive Learning: DARL innovatively combines autoregressive prediction with denoising diffusion mechanisms. This hybrid approach enables robust visual representation learning, exhibiting near-parity with leading masked prediction models under fine-tuning evaluations.
Positional Encoding Insights: Through extensive experimentation, the paper underscores the efficacy of decomposed Rotary Positional Embedding (RoPE) for causal Transformers in visual tasks. This novel 2D RoPE outperforms traditional positional encoding schemes, particularly enhancing autoregressive models.
Model Scaling and Noise Schedules: The research presents an exploration into the impacts of model size, training duration, and noise schedules on learning outcomes. Findings reveal that larger models and longer trainings, along with optimized noise schedules, favorably influence model performance.
Efficacy of MSE and Diffusion Objectives: The paper compares MSE loss against diffusion objectives for pre-training. Remarkably, MSE alone yields strong performance; however, diffusion objectives further refine generative capabilities, especially with tailored noise schedules and extended training regimes.

Theoretical and Practical Implications

From a theoretical standpoint, DARL's architecture prompts a reconsideration of generative pre-training's potential in visual tasks. It showcases that a unified model can adeptly handle both representation learning and image generation without compromising on performance. Practically, this work paves the way for more flexible and generalizable visual models that can be fine-tuned to a variety of downstream tasks with minimal performance loss, thereby broadening the applicability of generative models in real-world scenarios.

Future research could delve into refining the noise schedule and extending the model's capabilities to encompass more complex, multi-modal tasks. The insights regarding positional encoding also open avenues for further enhancing the Transformer architecture's applicability across various data types beyond images.

Conclusion

DARL marks a significant step towards realizing generative pre-training's full potential in vision. By adeptly blending autoregressive prediction with denoising diffusion processes within a cohesive framework, DARL not only matches but in some instances, surpasses the capabilities of contemporary benchmarks in visual representation learning. This research, by shedding light on the interactions between different model components and training objectives, contributes foundational knowledge that will undoubtedly inform the development of more advanced, versatile generative models in the future.

PDF Markdown

Related Papers

Tweets

https://twitter.com/fly51fly/status/1767192766990430570

https://twitter.com/knishimae0531/status/1767348170923889048