Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ClusterFormer: Clustering As A Universal Visual Learner (2309.13196v3)

Published 22 Sep 2023 in cs.CV

Abstract: This paper presents CLUSTERFORMER, a universal vision model that is based on the CLUSTERing paradigm with TransFORMER. It comprises two novel designs: 1. recurrent cross-attention clustering, which reformulates the cross-attention mechanism in Transformer and enables recursive updates of cluster centers to facilitate strong representation learning; and 2. feature dispatching, which uses the updated cluster centers to redistribute image features through similarity-based metrics, resulting in a transparent pipeline. This elegant design streamlines an explainable and transferable workflow, capable of tackling heterogeneous vision tasks (i.e., image classification, object detection, and image segmentation) with varying levels of clustering granularity (i.e., image-, box-, and pixel-level). Empirical results demonstrate that CLUSTERFORMER outperforms various well-known specialized architectures, achieving 83.41% top-1 acc. over ImageNet-1K for image classification, 54.2% and 47.0% mAP over MS COCO for object detection and instance segmentation, 52.4% mIoU over ADE20K for semantic segmentation, and 55.8% PQ over COCO Panoptic for panoptic segmentation. For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (102)
  1. Beyond pairwise clustering. In CVPR, 2005.
  2. The reverse hierarchy theory of visual perceptual learning. Trends in cognitive sciences, 8(10):457–464, 2004.
  3. Hierarchy theory: a vision, vocabulary, and epistemology. Columbia University Press, 1996.
  4. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  5. Categorical clustering of the neural representation of color. Journal of Neuroscience, 33(39):15454–15465, 2013.
  6. Heterogeneous image feature integration via multi-modal spectral clustering. In CVPR, 2011.
  7. Cascade r-cnn: Delving into high quality object detection. In CVPR, 2018.
  8. Cascade r-cnn: high quality object detection and instance segmentation. IEEE TPAMI, 43(5):1483–1498, 2019.
  9. End-to-end object detection with transformers. In ECCV, 2020.
  10. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
  11. M Emre Celebi. Partitional clustering algorithms. Springer, 2014.
  12. Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE TPAMI, 30(5):909–926, 2008.
  13. Blendmask: Top-down meets bottom-up for instance segmentation. In CVPR, 2020.
  14. Hybrid task cascade for instance segmentation. In CVPR, 2019.
  15. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
  16. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020.
  17. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  18. Sparse instance activation for real-time instance segmentation. In CVPR, 2022.
  19. Yiming Cui. Feature aggregated queries for transformer-based video object detectors. In CVPR, 2023.
  20. Tf-blender: Temporal feature blender for video object detection. In ICCV, 2021.
  21. Learning dynamic query combinations for transformer-based object detection and segmentation. ICML, 2023.
  22. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023.
  23. Solq: Segmenting objects by learning queries. In NeurIPS, 2021.
  24. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  25. Msg-transformer: Exchanging local spatial information by manipulating messenger tokens. In CVPR, 2022.
  26. Instances as queries. In ICCV, 2021.
  27. Contour integration by the human visual system: evidence for a local “association field”. Vision research, 33(2):173–193, 1993.
  28. A robust competitive clustering algorithm with applications in computer vision. IEEE TPAMI, 21(5):450–465, 1999.
  29. Self-organization in vision: stochastic clustering for image segmentation, perceptual grouping, and image database organization. IEEE TPAMI, 23(10):1053–1074, 2001.
  30. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
  31. Vision gnn: An image is worth graph of nodes. In NeurIPS, 2022.
  32. Adaptive pyramid context network for semantic segmentation. In CVPR, 2019.
  33. Mask r-cnn. In ICCV, 2017.
  34. Deep residual learning for image recognition. In CVPR, 2016.
  35. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, 2019.
  36. Algorithms for clustering data. Prentice-Hall, Inc., 1988.
  37. Superpixel sampling networks. In ECCV, 2018.
  38. Stephen C Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254, 1967.
  39. Robust clustering with applications in computer vision. IEEE TPAMI, 13(8):791–802, 1991.
  40. Bela Julesz. A brief outline of the texton theory of human vision. Trends in Neurosciences, 7(2):41–45, 1984.
  41. Panoptic feature pyramid networks. In CVPR, 2019.
  42. Panoptic segmentation. In CVPR, 2019.
  43. Pointrend: Image segmentation as rendering. In CVPR, 2020.
  44. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. CVPR, 2023.
  45. Fully convolutional networks for panoptic segmentation. In CVPR, 2021.
  46. Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In CVPR, 2022.
  47. Clustseg: Clustering for universal segmentation. In ICML, 2023.
  48. Expediting large-scale vision transformer for dense prediction without fine-tuning. NeurIPS, 2022.
  49. Microsoft coco: Common objects in context. In ECCV, 2014.
  50. Sg-net: Spatial granularity network for one-stage video instance segmentation. In CVPR, 2021.
  51. Densernet: Weakly supervised visual localization using multi-scale feature aggregation. In AAAI, 2021.
  52. Tripartite feature enhanced pyramid network for dense prediction. IEEE TIP, 2023.
  53. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  54. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  55. Combined central and subspace clustering for computer vision applications. In ICML, 2006.
  56. Grid r-cnn. In CVPR, 2019.
  57. Transflow: Transformer as flow learner. CVPR, 2023.
  58. Image as set of points. In ICLR, 2023.
  59. Celeste McCollough. Color adaptation of edge-detectors in the human visual system. Science, 149(3688):1115–1116, 1965.
  60. Conditional detr for fast training convergence. In ICCV, 2021.
  61. Scalable nearest neighbor algorithms for high dimensional data. IEEE TPAMI, 36(11):2227–2240, 2014.
  62. Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1):86–97, 2012.
  63. A survey on nature inspired metaheuristic algorithms for partitional clustering. Swarm and Evolutionary computation, 16:1–18, 2014.
  64. Hierarchical clustering. Introduction to HPC with MPI for Data Science, pages 195–211, 2016.
  65. Perceptual grouping induces non-retinotopic feature attribution in human vision. Vision Research, 46(19):3234–3242, 2006.
  66. The human visual system is optimised for processing the spatial information in natural visual images. Current Biology, 10(1):35–38, 2000.
  67. Two distinct mechanisms of suppression in human vision. Journal of Neuroscience, 25(38):8704–8707, 2005.
  68. Varifocal-net: A chromosome classification approach using deep convolutional networks. IEEE transactions on medical imaging, 38(11):2569–2581, 2019.
  69. A survey of partitional and hierarchical clustering algorithms. In Data clustering, pages 87–110. Chapman and Hall/CRC, 2018.
  70. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
  71. Sparse detr: Efficient end-to-end object detection with learnable sparsity. ICLR, 2022.
  72. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
  73. Dov Sagi. Perceptual learning in vision research. Vision research, 51(13):1552–1566, 2011.
  74. Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
  75. Sparse r-cnn: End-to-end object detection with learnable proposals. In CVPR, 2021.
  76. Teppei Suzuki. Clustering as attention: Unified image segmentation with hierarchical clustering. arXiv preprint arXiv:2205.09949, 2022.
  77. Efficientdet: Scalable and efficient object detection. In CVPR, 2020.
  78. Spectral–spatial classification of hyperspectral imagery based on partitional clustering techniques. IEEE transactions on geoscience and remote sensing, 47(8):2973–2987, 2009.
  79. Speed of processing in the human visual system. nature, 381(6582):520–522, 1996.
  80. Training data-efficient image transformers and distillation through attention. In ICML, 2021.
  81. Attention is all you need. In NeurIPS, 2017.
  82. A shared vision for machine learning in neuroscience. Journal of Neuroscience, 38(7):1601–1607, 2018.
  83. George Wald. Human vision and the spectrum. Science, 101(2635):653–658, 1945.
  84. MaX-DeepLab: End-to-end panoptic segmentation with mask transformers. In CVPR, 2021.
  85. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In CVPR, 2021.
  86. Visual recognition with deep nearest centroids. ICLR, 2023.
  87. Learning equivariant segmentation with instance-unique querying. NeurIPS, 2022.
  88. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
  89. Hugh R Wilson. Computational evidence for a rivalry hierarchy in vision. Proceedings of the National Academy of Sciences, 100(24):14499–14503, 2003.
  90. Segformer: Simple and efficient design for semantic segmentation with transformers. NeurIPS, 2021.
  91. A validity measure for fuzzy clustering. IEEE TPAMI, 13(08):841–847, 1991.
  92. Upsnet: A unified panoptic segmentation network. In CVPR, 2019.
  93. Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022.
  94. Cmt-deeplab: Clustering mask transformers for panoptic segmentation. In CVPR, 2022.
  95. k-means mask transformer. ECCV, 2022.
  96. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
  97. K-net: Towards unified image segmentation. NeurIPS, 2021.
  98. Hierarchical clustering algorithms for document datasets. Data mining and knowledge discovery, 10:141–168, 2005.
  99. End-to-end object detection with adaptive clustering transformer. BMVC, 2021.
  100. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
  101. Scene parsing through ade20k dataset. In CVPR, 2017.
  102. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
Citations (6)

Summary

We haven't generated a summary for this paper yet.