Multi-scale Convolutional Neural Networks for Crowd Counting (1702.02359v1)

Published 8 Feb 2017 in cs.CV

Abstract: Crowd counting on static images is a challenging problem due to scale variations. Recently deep neural networks have been shown to be effective in this task. However, existing neural-networks-based methods often use the multi-column or multi-network model to extract the scale-relevant features, which is more complicated for optimization and computation wasting. To this end, we propose a novel multi-scale convolutional neural network (MSCNN) for single image crowd counting. Based on the multi-scale blobs, the network is able to generate scale-relevant features for higher crowd counting performances in a single-column architecture, which is both accuracy and cost effective for practical applications. Complemental results show that our method outperforms the state-of-the-art methods on both accuracy and robustness with far less number of parameters.

Authors (5)

Lingke Zeng (1 paper)
Xiangmin Xu (54 papers)
Bolun Cai (11 papers)
Suo Qiu (3 papers)
Tong Zhang (569 papers)

Citations (196)

View on Semantic Scholar

Summary

Multi-scale Convolutional Neural Networks for Crowd Counting: A Detailed Overview

The paper conducted by Lingke Zeng et al. presents a novel single-column, multi-scale convolutional neural network (MSCNN) designed to improve crowd counting accuracy and efficiency in static images. This paper addresses the critical issue associated with scale variations in crowd scenes, directly tackling the limitations observed in existing multi-column and multi-network models.

Problem Domain and Existing Challenges

Crowd counting is a substantial computer vision challenge due to its implications for public safety in overcrowded areas. Traditional methodologies in crowd counting are bifurcated into detection-based methods, which individually identify each person in a crowd using object detectors, and regression-based methods, which compute a count based on image features. The detection-based approaches generally yield low robustness and accuracy due to occlusions and background complexities, while regression-based methods, even with handcrafted features, similarly encounter limitations due to scale variations.

Recent advancements have shown Convolutional Neural Networks (CNNs) can be leveraged to estimate crowd density and count effectively. Nonetheless, these models are constrained by their inability to address scale variations adequately, leading to increased computational costs and optimization complexity. Multi-column CNNs with varied kernel sizes and multi-network models have attempted to resolve this, yet introduced more parameters, thereby further complicating computational resource requirements.

Proposed MSCNN Architecture

The authors propose a single-column MSCNN with a novel multi-scale blob integration, reminiscent of the Inception module, which allows for the extraction of scale-relevant features. This approach consolidates multiple filter kernel sizes (9×9, 7×7, 5×5, 3×3) into a single network, optimizing for both accuracy and cost-effectiveness.

A detailed architectural analysis reveals the inclusion of multi-layer perceptrons for pixel-wise connections, ensuring precise density map regression. The use of Rectified Linear Unit (ReLU) activations prominently enhances the restoration of positive density values, a key aspect considering the nature of crowd density maps.

Quantitative Results and Analysis

Through experimental evaluation across two benchmark datasets – ShanghaiTech and UCF_CC_50 – the MSCNN demonstrates superior performance. The ShanghaiTech dataset results show the MSCNN achieving MAE scores of 83.8 and 17.7 across its two parts, outperforming the current state-of-the-art MCNN with a substantial reduction in model parameters (approximately 7x fewer). Similarly, on the UCF_CC_50 dataset, MSCNN attains an MAE of 363.7 with fivefold fewer parameters than comparably accurate CrowdNet models.

These numeric results underline the improvements in computational efficiency and robustness, further confirmed by lower MSE values across diverse crowd images in varying scales.

Theoretical and Practical Implications

This paper ushers in a shift towards more computationally feasible, scalable models for crowd counting, countering the prevalent issue of extensive parameter overhead in multi-network systems. The MSCNN architecture's applicability to real-world scenarios is pivotal, reducing the demand for expensive computational resources while maintaining precision in counting accuracy.

Conclusion and Future Directions

The introduction of MSCNN holds significant implications for improving crowd management and public safety measures. For future research, enhancing these models can involve exploring dynamic video feeds or integrating additional contextual data to further improve crowd estimations. The potential for MSCNN to be deployed in real-time applications opens avenues for advancements in interactive environments where crowd density monitoring is crucial.

PDF Markdown