Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 94 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 120 tok/s Pro
Kimi K2 162 tok/s Pro
GPT OSS 120B 470 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

Optimizing CNN Model Inference on CPUs (1809.02697v3)

Published 7 Sep 2018 in cs.DC

Abstract: The popularity of Convolutional Neural Network (CNN) models and the ubiquity of CPUs imply that better performance of CNN model inference on CPUs can deliver significant gain to a large number of users. To improve the performance of CNN inference on CPUs, current approaches like MXNet and Intel OpenVINO usually treat the model as a graph and use the high-performance libraries such as Intel MKL-DNN to implement the operations of the graph. While achieving reasonable performance on individual operations from the off-the-shelf libraries, this solution makes it inflexible to conduct optimizations at the graph level, as the local operation-level optimizations are predefined. Therefore, it is restrictive and misses the opportunity to optimize the end-to-end inference pipeline as a whole. This paper presents \emph{NeoCPU}, a comprehensive approach of CNN model inference on CPUs that employs a full-stack and systematic scheme of optimizations. \emph{NeoCPU} optimizes the operations as templates without relying on third-parties libraries, which enables further improvement of the performance via operation- and graph-level joint optimization. Experiments show that \emph{NeoCPU} achieves up to 3.45$\times$ lower latency for CNN model inference than the current state-of-the-art implementations on various kinds of popular CPUs.

Citations (146)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that NeoCPU significantly reduces inference latency by up to 3.45× through systematic operation- and graph-level optimizations.
  • NeoCPU employs customizable operation templates and a global search algorithm to identify optimal configurations while minimizing costly data layout transformations.
  • Evaluation against baselines across 15 CNN models shows superior performance on varied CPU architectures, emphasizing its applicability in both mainstream and edge devices.

Optimizing CNN Model Inference on CPUs

Introduction

The paper "Optimizing CNN Model Inference on CPUs" addresses the performance limitations inherent in utilizing Convolutional Neural Networks (CNNs) on Central Processing Units (CPUs). CNN models are prominent in computer vision applications, and efficient inference on widespread hardware platforms like CPUs is critically important. This work introduces NeoCPU, a framework agnostic optimization approach, focusing on maximizing CNN inference efficiency on CPUs through operation- and graph-level strategies, providing significant performance improvements.

NeoCPU Overview

NeoCPU encapsulates a systematic, full-stack set of optimizations for CNN inference on CPUs. By circumventing dependency on third-party high-performance libraries such as Intel MKL-DNN, NeoCPU facilitates customizable templates for operation-level optimizations and orchestrates graph-level transformations. Experiments demonstrate up to a 3.45× reduction in latency compared to state-of-the-art CPU implementations, achievable through various architectural optimizations. Figure 1

Figure 1: The illustration of CONV and the efficient implementation in AVX-512 instructions as an example.

The framework leverages advanced CPU features including SIMD and FMA, and adopts a comprehensive approach to tuning computationally intensive operations like convolutions through configurable templates adaptable to different workloads and architectures. NeoCPU's global optimization strategy reduces data layout transformations, optimizing end-to-end model efficiency.

Operation Optimization

Central to CNN computation is the convolution operation, known for its arithmetic intensity. The optimization employed in NeoCPU prioritizes enhancing data locality and maximizing vector processing unit utilization, by rearranging and blocking data dimensions effectively. This extends beyond single operation optimization, forming a template that facilitates coherent graph-level optimization without direct manipulation of assembly code.

Graph-Level Optimization

A significant part of the performance enhancement stems from the minimization of data layout transformations throughout the computation graph, deploying layout-oblivious, layout-tolerant, and layout-dependent classifications for operations. This strategy is executed by maintaining and managing data layout flow intelligently from operation to operation, minimizing costly format transformations at runtime (Figure 2). Figure 2

Figure 2: Layout optimization of a simple CNN model. The network on the left side shows unnecessary data layout transformations, while the optimized layout on the right reduces these overheads.

Global Optimization Scheme

NeoCPU incorporates an innovative global search for identifying optimal layout and execution parameters. By integrating both local (operation-level) and global (graph-level) searches, NeoCPU constructs a systemic approach to derive best-case configurations for complex, computationally intensive CNN models. Using dynamic programming and heuristic methods, the search navigates vast configuration spaces to enhance performance effectively (Figure 3). Figure 3

Figure 3: Global search for CNN model inference illustrating layout transformation decisions and their associated overheads.

Evaluation and Comparison

Evaluation of NeoCPU against existing baselines including MXNet, TensorFlow (with ngraph), and OpenVINO shows pervasive advantages. The performance metrics indicate a superior rate of inference on a variety of CPU architectures, validated across 15 popular CNN models. NeoCPU achieves optimal latency with advantages noted particularly on ARM platforms, where baselines like OpenVINO lack applicability owing to their dependency on x86 centered optimizations. Figure 4

Figure 4

Figure 4

Figure 4: ResNet-50 on a system with 18-core Intel Skylake CPU.

The competitive edge NeoCPU holds is significantly attributed to its comprehensive, modular optimization framework that dynamically adapts to CPU architectural specifics, yielding enhancements across multiple CPF (cycles per frame) configurations.

Conclusion

NeoCPU outlines a robust framework capable of delivering insightful improvements for CNN model inference on CPUs. It exhibits potential expansibility to include more diverse operations and leverage quantized computation methodologies, presenting a valuable asset for scaling performance on mainstream and edge computing platforms. Future development may focus on expanding the support to additional hardware types and exploring techniques specialized for dynamic model structures.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube