Autoregressive Pretraining with Mamba in Vision (2406.07537v1)

Published 11 Jun 2024 in cs.CV

Abstract: The vision community has started to build with the recently developed state space model, Mamba, as the new backbone for a range of tasks. This paper shows that Mamba's visual capability can be significantly enhanced through autoregressive pretraining, a direction not previously explored. Efficiency-wise, the autoregressive nature can well capitalize on the Mamba's unidirectional recurrent structure, enabling faster overall training speed compared to other training strategies like mask modeling. Performance-wise, autoregressive pretraining equips the Mamba architecture with markedly higher accuracy over its supervised-trained counterparts and, more importantly, successfully unlocks its scaling potential to large and even huge model sizes. For example, with autoregressive pretraining, a base-size Mamba attains 83.2\% ImageNet accuracy, outperforming its supervised counterpart by 2.0\%; our huge-size Mamba, the largest Vision Mamba to date, attains 85.0\% ImageNet accuracy (85.5\% when finetuned with $384\times384$ inputs), notably surpassing all other Mamba variants in vision. The code is available at \url{https://github.com/OliverRensu/ARM}.

Citations (6)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Related Papers

Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining (2024)
A Survey on Visual Mamba (2024)
MambaOut: Do We Really Need Mamba for Vision? (2024)
Mamba-R: Vision Mamba ALSO Needs Registers (2024)
MambaVision: A Hybrid Mamba-Transformer Vision Backbone (2024)

GitHub

GitHub - OliverRensu/ARM

Tweets

https://twitter.com/JHUCompSci/status/1914359402074722487