SYRAC: Synthesize, Rank, and Count (2310.01662v3)
Abstract: Crowd counting is a critical task in computer vision, with several important applications. However, existing counting methods rely on labor-intensive density map annotations, necessitating the manual localization of each individual pedestrian. While recent efforts have attempted to alleviate the annotation burden through weakly or semi-supervised learning, these approaches fall short of significantly reducing the workload. We propose a novel approach to eliminate the annotation burden by leveraging latent diffusion models to generate synthetic data. However, these models struggle to reliably understand object quantities, leading to noisy annotations when prompted to produce images with a specific quantity of objects. To address this, we use latent diffusion models to create two types of synthetic data: one by removing pedestrians from real images, which generates ranked image pairs with a weak but reliable object quantity signal, and the other by generating synthetic images with a predetermined number of objects, offering a strong but noisy counting signal. Our method utilizes the ranking image pairs for pre-training and then fits a linear layer to the noisy synthetic images using these crowd quantity features. We report state-of-the-art results for unsupervised crowd counting.
- Completely self-supervised crowd counting via distribution matching. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (eds.), Computer Vision – ECCV 2022, pp. 186–204, Cham, 2022. Springer Nature Switzerland.
- Adaptive dilated network with self-correction supervision for counting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4594–4603, 2020.
- Rethinking spatial invariance of convolutional networks for object counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19638–19648, June 2022.
- Dall-eval: Probing the reasoning skills and social biases of text-to-image generative models. arXiv preprint arXiv:2202.04053, 2022.
- Towards partial supervision for generic object counting in natural scenes. IEEE TPAMI, 2020.
- Learning-to-count by learning-to-rank. In 2023 20th Conference on Robots and Vision (CRV), pp. 105–112, 2023. doi: 10.1109/CRV60082.2023.00021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Is synthetic data from generative models ready for image recognition? In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=nUmCcZ5RKF.
- Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European conference on computer vision (ECCV), pp. 532–546, 2018.
- Learning to count objects in images. Advances in neural information processing systems, 23, 2010.
- Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1091–1100, 2018.
- Crowdclip: Unsupervised crowd counting via vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2893–2903, June 2023a.
- Crowdclip: Unsupervised crowd counting via vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2893–2903, 2023b.
- Optimal transport minimization: Crowd localization on density maps for semi-supervised counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21663–21673, June 2023.
- Leveraging unlabeled data for crowd counting by learning to rank. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7661–7669, 2018.
- Semi-supervised crowd counting via self-training on surrogate tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, pp. 242–259. Springer, 2020.
- SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=aBsCjcPu_tE.
- Spatial uncertainty-aware semi-supervised crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15549–15559, 2021.
- Teaching clip to count to ten. arXiv preprint arXiv:2302.12066, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- Almost unsupervised learning for dense crowd counting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 8868–8875, 2019.
- Diversity is definitely needed: Improving model-agnostic zero-shot classification via stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 769–778, June 2023.
- Learning to count in the crowd from limited labeled data. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 212–229. Springer, 2020a.
- Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2594–2609, 2020b.
- A generalized loss function for crowd counting and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1974–1983, June 2021a.
- A generalized loss function for crowd counting and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1974–1983, 2021b.
- Nwpu-crowd: A large-scale benchmark for crowd counting and localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. doi: 10.1109/TPAMI.2020.3013269.
- Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7959–7971, 2022.
- Investigating why contrastive learning benefits robustness against label noise. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 24851–24871. PMLR, 17–23 Jul 2022.
- Weakly-supervised crowd counting learns from sorting rather than locations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pp. 1–17. Springer, 2020.
- Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 589–597, 2016.
- Active crowd counting with limited supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pp. 565–581. Springer, 2020.
- Contrast to divide: Self-supervised pre-training for learning with noisy labels. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1657–1667, 2022.