MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis (2211.09117v2)
Abstract: Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning. Our key insight is that using variable masking ratios in masked image modeling pre-training can allow generative training (very high masking ratio) and representation learning (lower masking ratio) under the same training framework. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs, combining this with masking. We can further improve the representation by adding a contrastive loss to the encoder output. We extensively evaluate the generation and representation learning capabilities of MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation and 78.9% top-1 accuracy for linear probing, achieving state-of-the-art performance in both image generation and representation learning. Code is available at https://github.com/LTH14/mage.
- Masked siamese networks for label-efficient learning. arXiv preprint arXiv:2204.07141, 2022.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
- Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
- Large scale GAN training for high fidelity natural image synthesis. In Int. Conf. on Learning Representations (ICLR), 2019.
- Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020.
- Emerging properties in self-supervised vision transformers. In Int. Conference on Computer Vision (ICCV), pages 9650–9660, 2021.
- Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022.
- A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
- A simple framework for contrastive learning of visual representations. In icml, pages 1597–1607. PMLR, 2020.
- Big self-supervised models are strong semi-supervised learners. Advances in Neural Information Processing Systems, 33, 2020.
- Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022.
- Exploring simple siamese representation learning. arXiv preprint arXiv:2011.10566, 2020.
- An empirical study of training self-supervised vision transformers. In Int. Conference on Computer Vision (ICCV), pages 9640–9649, 2021.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
- Large scale adversarial representation learning. Advances in neural information processing systems, 32, 2019.
- Peco: Perceptual codebook for bert pre-training of vision transformers. arXiv preprint arXiv:2111.12710, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Int. Conf. on Learning Representations (ICLR), 2021.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
- Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
- Generative adversarial nets. 2014.
- Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
- Masked autoencoders are scalable vision learners. https://github.com/facebookresearch/mae, 2021.
- Masked autoencoders are scalable vision learners. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, June 2022.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020.
- Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016.
- Contrastive masked autoencoders are stronger vision learners. arXiv:2207.13532v1, 2022.
- A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
- Autoregressive image generation using residual quantization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Making contrastive learning robust to shortcuts. arXiv preprint arXiv:2012.09962, 2020.
- The devil is in the frequency: Geminated gestalt autoencoder for self-supervised visual pre-training. arXiv preprint arXiv:2204.08227, 2022.
- Diverse image generation via self-conditioned gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14286–14295, 2020.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- High-fidelity image generation with fewer labels. In International conference on machine learning, pages 4183–4192. PMLR, 2019.
- Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
- Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pages 69–84. Springer, 2016.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
- Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
- Can contrastive learning avoid shortcut solutions? Advances in neural information processing systems, 34:4974–4986, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
- Score-based generative modeling through stochastic differential equations. In Int. Conf. on Learning Representations (ICLR), 2021.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
- Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019.
- Neural discrete representation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022.
- Mvp: Multimodality-guided visual pre-training. arXiv preprint arXiv:2203.05175, 2022.
- Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.
- Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019.
- mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
- Self-attention generative adversarial networks. In Int. Conference on Machine Learning (ICML), pages 7354–7363, 2019.
- StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
- Improved transformer for high-resolution gans. Advances in Neural Information Processing Systems, 34:18367–18380, 2021.
- ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.