Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design (2305.13035v5)
Abstract: Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling.
- Revisiting neural scaling laws in language and vision. In Advances in neural information processing systems (NeurIPS).
- Explaining neural scaling laws. arXiv preprint arXiv:2102.06701.
- Data scaling laws in NMT: The effect of noise and architecture. arXiv preprint arXiv:2202.01994.
- Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32.
- Revisiting resnets: Improved training and scaling strategies. Advances in neural information processing systems (NeurIPS).
- Are we done with imagenet? CoRR, abs/2006.07159.
- Flexivit: One model for all patch sizes. In CVPR.
- A study of autoregressive decoders for multi-tasking in computer vision.
- Better plain vit baselines for imagenet-1k.
- Big vision. https://github.com/google-research/big_vision.
- Convex optimization. Cambridge university press.
- Wide attention is the way forward for transformers. arXiv preprint arXiv:2210.00640.
- Language models are few-shot learners. Advances in neural information processing systems (NeurIPS).
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- Pali: A jointly-scaled multilingual language-image model.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).
- Code, P. W. (2023). Papers With Code: ImageNet Benchmark. https://paperswithcode.com/sota/image-classification-on-imagenet. [Online; accessed 16-May-2023].
- Coatnet: Marrying convolution and attention for all data sizes. Advances in neural information processing systems (NeurIPS).
- The efficiency misnomer. In ICLR.
- Scaling vision transformers to 22 billion parameters.
- Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR).
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Representation Learning (ICLR).
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
- Scaling laws for neural machine translation. arXiv preprint arXiv:2109.07740.
- Data and parameter scaling laws for neural machine translation. In Conference on Empirical Methods in Natural Language Processing.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR.
- Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR).
- The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV.
- Natural adversarial examples. CVPR.
- Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701.
- Scaling laws for transfer. arXiv preprint arXiv:2102.01293.
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409.
- Training compute-optimal large language models. In Advances in neural information processing systems (NeurIPS).
- MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
- GQA: a new dataset for compositional question answering over real-world images. arXiv preprint arXiv:1902.09506.
- Hutter, M. (2021). Learning curve theory. arXiv preprint arXiv:2102.04074.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Multi-class texture analysis in colorectal cancer histology. Scientific reports, 6:27988.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Panoptic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR).
- Big transfer (BiT): General visual representation learning. In European Conference on Computer Vision (ECCV).
- UViM: A unified modeling approach for vision with learned guiding codes. Advances in neural information processing systems (NeurIPS).
- 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia.
- Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report.
- Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on Machine Learning (ICML).
- Microsoft coco: Common objects in context. In ECCV.
- Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019.
- OCR-VQA: Visual question answering by reading text in images. In ICDAR.
- Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition.
- Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350.
- Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050.
- Learning transferable visual models from natural language supervision. In ICML.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
- Do imagenet classifiers generalize to imagenet? CoRR, abs/1902.10811.
- Imagenet large scale visual recognition challenge. CoRR, abs/1409.0575.
- MobileNetV2: Inverted residuals and linear bottlenecks. In Conference on Computer Vision and Pattern Recognition (CVPR).
- Scaling laws from the data manifold dimension. Journal of Machine Learning Research, 23.
- Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning (ICML).
- The effectiveness of mae pre-pretraining for billion-scale pretraining. arXiv preprint arXiv:2303.13496.
- How to train your vit? data, augmentation, and regularization in vision transformers. Transactions on Machine Learning Research.
- Revisiting unreasonable effectiveness of data in deep learning era. In International Conference on Computer Vision (ICCV).
- EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (ICML).
- Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551.
- Scale efficiently: Insights from pre-training and fine-tuning transformers. In International Conference on Representation Learning (ICLR).
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning (ICML).
- DeiT III: Revenge of the ViT. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 516–533. Springer.
- Llama: Open and efficient foundation language models.
- Fixing the train-test resolution discrepancy. Advances in neural information processing systems, 32.
- Neural discrete representation learning. Advances in neural information processing systems (NeurIPS).
- Attention is all you need. Advances in neural information processing systems (NeurIPS).
- Cider: Consensus-based image description evaluation. In CVPR.
- Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology.
- Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognition.
- Bag-of-visual-words and spatial extensions for land-use classification. In ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS).
- CoCa: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
- Wide residual networks. arXiv preprint arXiv:1605.07146.
- Scaling vision transformers. In Conference on Computer Vision and Pattern Recognition (CVPR).
- LiT: Zero-shot transfer with locked-image text tuning. In CVPR, pages 18123–18133.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.