Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation (2205.01271v4)

Published 3 May 2022 in cs.CV

Abstract: Pose estimation plays a critical role in human-centered vision applications. However, it is difficult to deploy state-of-the-art HRNet-based pose estimation models on resource-constrained edge devices due to the high computational cost (more than 150 GMACs per frame). In this paper, we study efficient architecture design for real-time multi-person pose estimation on edge. We reveal that HRNet's high-resolution branches are redundant for models at the low-computation region via our gradual shrinking experiments. Removing them improves both efficiency and performance. Inspired by this finding, we design LitePose, an efficient single-branch architecture for pose estimation, and introduce two simple approaches to enhance the capacity of LitePose, including Fusion Deconv Head and Large Kernel Convs. Fusion Deconv Head removes the redundancy in high-resolution branches, allowing scale-aware feature fusion with low overhead. Large Kernel Convs significantly improve the model's capacity and receptive field while maintaining a low computational cost. With only 25% computation increment, 7x7 kernels achieve +14.0 mAP better than 3x3 kernels on the CrowdPose dataset. On mobile platforms, LitePose reduces the latency by up to 5.0x without sacrificing performance, compared with prior state-of-the-art efficient pose estimation models, pushing the frontier of real-time multi-person pose estimation on edge. Our code and pre-trained models are released at https://github.com/mit-han-lab/litepose.

Citations (63)

View on Semantic Scholar

Summary

The paper proposes a single-branch design that eliminates redundancy in traditional multi-branch architectures for low-computation settings.
It introduces innovations like fusion deconv head and large kernel convolutions to enhance receptive fields and improve mAP.
NAS-driven optimization helps LitePose reduce MACs significantly, enabling real-time pose estimation on edge devices.

Summary of LitePose: Efficient Architecture Design for 2D Human Pose Estimation

The paper presents LitePose, a novel architecture tailored for efficient 2D human pose estimation in resource-constrained environments, such as edge devices. This development is crucial for real-time applications that require processing multiple human poses simultaneously, a task traditionally bottlenecked by high computational demands.

Key Contributions

The authors propose LitePose as an alternative to high-resolution multi-branch architectures like HRNet, which, while effective, require significant computational resources. The paper identifies redundancy in HRNet’s high-resolution branches when applied in low-computation settings and advocates for a compact, single-branch design. The efficacy of LitePose is showcased through several innovative strategies:

Gradual Shrinking Experiment: By systematically reducing the depth of high-resolution branches in HRNet, it was demonstrated that removing these branches actually improves performance in low-computation environments. This finding catalyzes the transition to a single-branch architecture, optimizing resource efficiency.
Fusion Deconv Head: This approach integrates low-level, high-resolution features directly into the deconvolutional layers, negating the need for redundant multi-branch high-resolution refinement. This modification enables scale-aware fusion with minimal computational overhead.
Large Kernel Convolutions (Convs): While traditional image classification does not benefit significantly from increased kernel sizes, the use of large kernel convs in LitePose enhances receptive fields without a proportional increase in computational cost. A $7 \times 7$ kernel yields a notable improvement in mean Average Precision (mAP) compared to smaller kernels, especially pertinent in pose estimation.
Neural Architecture Search (NAS): LitePose utilizes NAS methodology to optimize layer configurations and channel widths, selecting the most effective input resolutions for varying computational budgets. This automation ensures that the architecture is tailored specifically to the performance constraints typical of edge devices.

Performance and Evaluation

The evaluation of LitePose on benchmark datasets, COCO and CrowdPose, underscores its efficiency. LitePose reduces MACs by a factor of 2.8-5.1 times compared to HRNet-derived models while achieving comparable or improved accuracy (mAP). On various mobile platforms, it executes with substantially lower latency due to its parallelism-friendly single-branch configuration. Such improvements highlight its suitability for deployment in real-world applications where computational resources are limited.

Implications and Future Work

The transition from a multi-branch to an efficient single-branch architecture marks a significant step towards making sophisticated human pose estimation feasible on edge devices. Practically, this could catalyze innovations in fields such as augmented reality, autonomous systems, and user-interface development in resource-constrained settings. Theoretically, the work invites exploration into further architectural optimizations and adaptive algorithmic designs that minimize computational requirements while maximizing performance.

Future research directions could include exploring different backbone architectures adaptable to LitePose’s framework and refining NAS strategies to incorporate more comprehensive design dimensions, such as power consumption or thermal efficiency indicators alongside computational metrics. Additionally, the robustness of LitePose in real-time scenarios with dynamically varying computational loads could be an insightful area for further investigation.

In conclusion, LitePose stands as a testament to the potential of tailored, efficient architectures in overcoming the challenges posed by computational constraints, thereby expanding the practical applicability of cutting-edge pose estimation techniques.

PDF Markdown

Related Papers

GitHub

GitHub - mit-han-lab/litepose: [CVPR'22] Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation (310 stars)

YouTube

Show All Videos