Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design (2305.13035v5)
Abstract: Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling.
- Revisiting neural scaling laws in language and vision. In Advances in neural information processing systems (NeurIPS).
- Explaining neural scaling laws. arXiv preprint arXiv:2102.06701.
- Data scaling laws in NMT: The effect of noise and architecture. arXiv preprint arXiv:2202.01994.
- Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32.
- Revisiting resnets: Improved training and scaling strategies. Advances in neural information processing systems (NeurIPS).
- Are we done with imagenet? CoRR, abs/2006.07159.
- Flexivit: One model for all patch sizes. In CVPR.
- A study of autoregressive decoders for multi-tasking in computer vision.
- Better plain vit baselines for imagenet-1k.
- Big vision. https://github.com/google-research/big_vision.
- Convex optimization. Cambridge university press.
- Wide attention is the way forward for transformers. arXiv preprint arXiv:2210.00640.
- Language models are few-shot learners. Advances in neural information processing systems (NeurIPS).
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- Pali: A jointly-scaled multilingual language-image model.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
- Code, P. W. (2023). Papers With Code: ImageNet Benchmark. https://paperswithcode.com/sota/image-classification-on-imagenet. [Online; accessed 16-May-2023].
- Coatnet: Marrying convolution and attention for all data sizes. Advances in neural information processing systems (NeurIPS).
- The efficiency misnomer. In ICLR.
- Scaling vision transformers to 22 billion parameters.
- Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR).
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Representation Learning (ICLR).
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
- Scaling laws for neural machine translation. arXiv preprint arXiv:2109.07740.
- Data and parameter scaling laws for neural machine translation. In Conference on Empirical Methods in Natural Language Processing.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR.
- Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR).
- The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV.
- Natural adversarial examples. CVPR.
- Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701.
- Scaling laws for transfer. arXiv preprint arXiv:2102.01293.
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409.
- Training compute-optimal large language models. In Advances in neural information processing systems (NeurIPS).
- MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
- GQA: a new dataset for compositional question answering over real-world images. arXiv preprint arXiv:1902.09506.
- Hutter, M. (2021). Learning curve theory. arXiv preprint arXiv:2102.04074.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Multi-class texture analysis in colorectal cancer histology. Scientific reports, 6:27988.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Panoptic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR).
- Big transfer (BiT): General visual representation learning. In European Conference on Computer Vision (ECCV).
- UViM: A unified modeling approach for vision with learned guiding codes. Advances in neural information processing systems (NeurIPS).
- 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia.
- Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report.
- Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on Machine Learning (ICML).
- Microsoft coco: Common objects in context. In ECCV.
- Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019.
- OCR-VQA: Visual question answering by reading text in images. In ICDAR.
- Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition.
- Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350.
- Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050.
- Learning transferable visual models from natural language supervision. In ICML.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
- Do imagenet classifiers generalize to imagenet? CoRR, abs/1902.10811.
- Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575.
- MobileNetV2: Inverted residuals and linear bottlenecks. In Conference on Computer Vision and Pattern Recognition (CVPR).
- Scaling laws from the data manifold dimension. Journal of Machine Learning Research, 23.
- Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning (ICML).
- The effectiveness of mae pre-pretraining for billion-scale pretraining. arXiv preprint arXiv:2303.13496.
- How to train your vit? data, augmentation, and regularization in vision transformers. Transactions on Machine Learning Research.
- Revisiting unreasonable effectiveness of data in deep learning era. In International Conference on Computer Vision (ICCV).
- EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (ICML).
- Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551.
- Scale efficiently: Insights from pre-training and fine-tuning transformers. In International Conference on Representation Learning (ICLR).
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (ICML).
- DeiT III: Revenge of the ViT. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 516–533. Springer.
- Llama: Open and efficient foundation language models.
- Fixing the train-test resolution discrepancy. Advances in neural information processing systems, 32.
- Neural discrete representation learning. Advances in neural information processing systems (NeurIPS).
- Attention is all you need. Advances in neural information processing systems (NeurIPS).
- Cider: Consensus-based image description evaluation. In CVPR.
- Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology.
- Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognition.
- Bag-of-visual-words and spatial extensions for land-use classification. In ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS).
- CoCa: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
- Wide residual networks. arXiv preprint arXiv:1605.07146.
- Scaling vision transformers. In Conference on Computer Vision and Pattern Recognition (CVPR).
- LiT: Zero-shot transfer with locked-image text tuning. In CVPR, pages 18123–18133.