Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
104 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Explore In-Context Segmentation via Latent Diffusion Models (2403.09616v2)

Published 14 Mar 2024 in cs.CV

Abstract: In-context segmentation has drawn increasing attention with the advent of vision foundation models. Its goal is to segment objects using given reference images. Most existing approaches adopt metric learning or masked image modeling to build the correlation between visual prompts and input image queries. This work approaches the problem from a fresh perspective - unlocking the capability of the latent diffusion model (LDM) for in-context segmentation and investigating different design choices. Specifically, we examine the problem from three angles: instruction extraction, output alignment, and meta-architectures. We design a two-stage masking strategy to prevent interfering information from leaking into the instructions. In addition, we propose an augmented pseudo-masking target to ensure the model predicts without forgetting the original images. Moreover, we build a new and fair in-context segmentation benchmark that covers both image and video datasets. Experiments validate the effectiveness of our approach, demonstrating comparable or even stronger results than previous specialist or visual foundation models. We hope our work inspires others to rethink the unification of segmentation and generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (105)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390, 2021.
  3. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, 2022.
  4. Towards in-context scene understanding. In NeurIPS, 2023.
  5. Beit: Bert pre-training of image transformers. In ICLR, 2022.
  6. Visual prompting via image inpainting. In NeurIPS, 2022.
  7. Label-efficient semantic segmentation with diffusion models. In ICLR, 2022.
  8. Few-shot segmentation without meta-learning: A good transductive inference is all you need? In CVPR, 2021.
  9. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
  10. Language models are few-shot learners. In NeurIPS, 2020.
  11. Diffusiondet: Diffusion model for object detection. In ICCV, 2023.
  12. A generalist framework for panoptic segmentation of images and videos. In ICCV, 2023.
  13. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  14. MeViS: A large-scale benchmark for video segmentation with motion expressions. In ICCV, 2023.
  15. MOSE: A new dataset for video object segmentation in complex scenes. In ICCV, 2023.
  16. The pascal visual object classes (voc) challenge. IJCV, 2010.
  17. Explore in-context learning for 3d point cloud understanding. In NeurIPS, 2023.
  18. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  19. Instructdiffusion: A generalist modeling interface for vision tasks. arXiv preprint arXiv:2309.03895, 2023.
  20. Diffusioninst: Diffusion model for instance segmentation. In ICASSP, 2024.
  21. Parameter-efficient transfer learning with diff pruning. In ACL, 2021.
  22. Flexible diffusion modeling of long videos. In NeurIPS, 2022.
  23. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  24. Prompt-to-prompt image editing with cross attention control. In ICLR, 2023.
  25. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  26. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 2022.
  27. Classifier-free diffusion guidance. In NeurIPS Workshops, 2021.
  28. Cost aggregation with 4d convolutional swin transformer for few-shot segmentation. In ECCV, 2022.
  29. Parameter-efficient transfer learning for nlp. In ICML, 2019.
  30. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  31. Msanet: Multi-similarity and attention guidance for boosting few-shot segmentation. arXiv preprint arXiv:2206.09667, 2022.
  32. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  33. Slime: Segment like me. arXiv preprint arXiv:2309.03179, 2023.
  34. Few-shot biomedical image segmentation using diffusion models: Beyond image generation. Computer Methods and Programs in Biomedicine, 2023.
  35. Universal few-shot learning of dense prediction tasks with visual token matching. In ICLR, 2023.
  36. Segment anything. In ICCV, 2023.
  37. Learning what not to segment: A new perspective on few-shot segmentation. In CVPR, 2022.
  38. Maskdiff: Modeling mask distribution with diffusion probabilistic model for few-shot instance segmentation. arXiv preprint arXiv:2303.05105, 2023.
  39. The power of scale for parameter-efficient prompt tuning. In EMNLP, 2021.
  40. Adaptive prototype learning and allocation for few-shot segmentation. In CVPR, 2021.
  41. Transformer-based visual segmentation: A survey. arXiv pre-print, 2023.
  42. Sd4match: Learning to prompt stable diffusion model for semantic matching. arXiv preprint arXiv:2310.17569, 2023.
  43. Fss-1000: A 1000-class dataset for few-shot segmentation. In CVPR, 2020.
  44. Omg-seg: Is one model good enough for all segmentation? In CVPR, 2024.
  45. Video k-net: A simple, strong, and unified baseline for video segmentation. In CVPR, 2022.
  46. Prefix-tuning: Optimizing continuous prompts for generation. In ACL, 2021.
  47. Open-vocabulary object segmentation with diffusion models. In ICCV, 2023.
  48. Microsoft coco: Common objects in context. In ECCV, 2014.
  49. Dynamic prototype convolution network for few-shot semantic segmentation. In CVPR, 2022.
  50. Learning non-target knowledge for few-shot semantic segmentation. In CVPR, 2022.
  51. Intermediate prototype mining transformer for few-shot semantic segmentation. In NeurIPS, 2022.
  52. Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2023.
  53. Simpler is better: Few-shot semantic segmentation with classifier weight transformer. In ICCV, 2021.
  54. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR, 2022.
  55. Diffusiontrack: Diffusion model for multi-object tracking. In AAAI, 2024.
  56. Diffusion probabilistic models for 3d point cloud generation. In CVPR, 2021.
  57. Score-based point cloud denoising. In ICCV, 2021.
  58. A conditional point diffusion-refinement paradigm for 3d point cloud completion. In ICLR, 2022.
  59. Sdedit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
  60. Vspw: A large-scale dataset for video scene parsing in the wild. In CVPR, 2021.
  61. Hypercorrelation squeeze for few-shot segmentation. In ICCV, 2021.
  62. Scalable diffusion models with transformers. In ICCV, 2023.
  63. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
  64. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In ICLR, 2024.
  65. Conditional networks for few-shot semantic segmentation. In ICLR, 2018.
  66. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  67. Palette: Image-to-image diffusion models. In SIGGRAPH, 2022.
  68. Image super-resolution via iterative refinement. TPAMI, 2022.
  69. One-shot learning for semantic segmentation. In BMVC, 2017.
  70. Dense cross-query-and-support attention weighted mask aggregation for few-shot segmentation. In ECCV, 2022.
  71. Denoising diffusion implicit models. In ICLR, 2021.
  72. Score-based generative modeling through stochastic differential equations. In ICLR, 2021.
  73. Singular value fine-tuning: Few-shot segmentation requires few-parameters fine-tuning. In NeurIPS, 2022.
  74. Diffss: Diffusion model for few-shot semantic segmentation. arXiv preprint arXiv:2307.00773, 2023.
  75. Prior guided feature enrichment network for few-shot segmentation. TPAMI, 2020.
  76. Repurposing gans for one-shot semantic part segmentation. In CVPR, 2021.
  77. A simple latent diffusion approach for panoptic segmentation and mask inpainting. arXiv preprint arXiv:2401.10227, 2024.
  78. Harnessing diffusion models for visual perception with meta prompts. arXiv preprint arXiv:2312.14733, 2023.
  79. Panet: Few-shot image semantic segmentation with prototype alignment. In ICCV, 2019.
  80. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022.
  81. Skeleton-in-context: Unified skeleton sequence modeling with in-context learning. arXiv preprint arXiv:2312.03703, 2023.
  82. Images speak in images: A generalist painter for in-context visual learning. In CVPR, 2023.
  83. Seggpt: Segmenting everything in context. In ICCV, 2023.
  84. In-context learning unlocked for diffusion models. In NeurIPS, 2023.
  85. Towards language-driven video inpainting via multimodal large language models. CVPR, 2024.
  86. Towards open vocabulary learning: A survey. T-PAMI, 2024.
  87. Few-shot semantic segmentation with cyclic memory network. In ICCV, 2021.
  88. Mosaicfusion: Diffusion models as data augmenters for large vocabulary instance segmentation. arXiv preprint arXiv:2309.13042, 2023.
  89. Simmim: A simple framework for masked image modeling. In CVPR, 2022.
  90. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023.
  91. Prototype mixture models for few-shot semantic segmentation. In ECCV, 2020.
  92. Diffusion probabilistic modeling for video generation. Entropy, 2023.
  93. Polyphonicformer: Unified query learning for depth-aware video panoptic segmentation. 2022.
  94. Open-vocabulary sam: Segment and recognize twenty-thousand classes interactively. arXiv preprint, 2024.
  95. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In ACL, 2022.
  96. Lion: Latent point diffusion models for 3d shape generation. In NeurIPS, 2022.
  97. Few-shot segmentation via cycle-consistent transformer. In NeurIPS, 2021.
  98. Feature-proxy transformer for few-shot segmentation. In NeurIPS, 2022.
  99. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  100. Personalize segment anything model with one shot. In ICLR, 2024.
  101. Datasetgan: Efficient labeled data factory with minimal human effort. In CVPR, 2021.
  102. What makes good examples for visual in-context learning? In NeurIPS, 2023.
  103. Unleashing text-to-image diffusion models for visual perception. In ICCV, 2023.
  104. Edgesam: Prompt-in-the-loop distillation for on-device deployment of sam. arXiv preprint arXiv:2312.06660, 2023.
  105. 3d shape generation and completion through point-voxel diffusion. In ICCV, 2021.
Citations (5)

Summary

We haven't generated a summary for this paper yet.