Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 147 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 58 tok/s Pro
Kimi K2 201 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Understanding the Performance Horizon of the Latest ML Workloads with NonGEMM Workloads (2404.11788v5)

Published 17 Apr 2024 in cs.AR, cs.LG, and cs.PF

Abstract: Among ML operators today, GEneralMatrix Multiplication (GEMM)-based operators are known to be key operators that build the main backbone of ML models. As their computational overhead dominates the overall execution time (e.g., 42.8% - 96.6% in our results), GEMM operators have been the prime optimization targets for fast ML inference. This led to advanced GPUs and accelerators available today, which provided significant boost in the GEMM performance compared to CPUs, aligned with the lesson from Amdahl's law. However, accelerating GEMM has significantly shifted the Amdahl's law's landscape for ML inference; due to the decreased GEMM execution time, the relative execution time of non-GEMM operators is now significant. Although the importance of non-GEMM performance is increasing, we have little knowledge about the non-GEMM performance horizon in the latest hardware platforms and models. Therefore, to guide non-GEMM-oriented optimizations, we conduct a thorough performance analysis of 17 widely adopted ML models in Hugging Face and Torchvision on workstation and data center platforms with/without GPUs. We discover that non-GEMM performance bottleneck is a considerable issue across all the platforms and models, accounting for 11.3% to 73.6% of total latency, on average. The challenge significantly aggravates when we apply quantization, which is a common model compression technique, due to the boosted GEMM performance and extra non-GEMM operators for dequantization and requantization. To provide insights into non-GEMM optimization targets, we demystify the most dominant non-GEMM operators for each model and deployment software. We also show that widely adopted optimizations such as operator fusion do not completely address the non-GEMM performance bottleneck, where non-GEMM operators still account for 15% to 48% of total latency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Hugging face models hub. https://huggingface.co/models. Accessed on March 8, 2024.
  2. Hugging face models hub - object detection. https://huggingface.co/models?pipeline_tag=object-detection&sort=downloads. Accessed on March 8, 2024.
  3. Hugging face models hub - text generation. https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads. Accessed on March 8, 2024.
  4. WikiText-2 Dataset. = ”https://huggingface.co/datasets/wikitext. Accessed on April 12, 2024.
  5. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  6. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  7. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  8. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
  9. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
  10. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  11. Tandem processor: Grappling with emerging operators in neural networks. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024.
  12. Torchbench: Benchmarking pytorch with high api surface coverage. arXiv preprint arXiv:2304.14226, 2023.
  13. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (ICCV), pages 2961–2969, 2017.
  14. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  15. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  16. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
  17. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  18. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  19. Longtail-bench: A benchmark suite for domain-specific operators in deep learning. In 2022 IEEE International Symposium on Workload Characterization (IISWC), pages 282–295. IEEE, 2022.
  20. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  21. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pages 10012–10022, 2021.
  22. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  23. Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 446–459. IEEE, 2020.
  24. Torch. fx: Practical program capture and transformation for deep learning in python. Proceedings of Machine Learning and Systems, 4:638–651, 2022.
  25. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  26. Baidu Research. Deepbench, 2016.
  27. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 4510–4520, 2018.
  28. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  29. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  30. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
  31. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: