Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities (2306.04829v2)

Published 7 Jun 2023 in cs.CV and cs.LG

Abstract: Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. MONet: Unsupervised Scene Decomposition and Representation. arXiv:1901.11390, 2019. URL https://arxiv.org/abs/1901.11390.
  2. Multi-Object Representation Learning with Iterative Variational Inference. In ICML, 2019. URL https://arxiv.org/abs/1903.00450.
  3. Object-Centric Learning with Slot Attention. In NeurIPS, 2020. URL https://proceedings.neurips.cc/paper/2020/file/8511df98c02ab60aea1b2356c013bc0f-Paper.pdf.
  4. SCALOR: Generative World Models with Scalable Object Representations. In ICLR, 2020. URL https://openreview.net/pdf?id=SJxrKgStDH.
  5. Conditional Object-centric Learning from Video. In ICLR, 2022. URL https://openreview.net/forum?id=aD7uesX1GF_.
  6. SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos. In NeurIPS, 2022. URL https://openreview.net/forum?id=fT9W53lLxNS.
  7. Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos. In NeurIPS, 2022a. URL https://openreview.net/forum?id=eYfIM88MTUE.
  8. Discovering Objects that Can Move. CVPR, 2022. URL https://arxiv.org/abs/2203.10159.
  9. Bridging the gap to real-world object-centric learning. In ICLR, 2023. URL https://openreview.net/forum?id=b9tUk-f_aG.
  10. Emerging Properties in Self-Supervised Vision Transformers. ICCV, 2021. URL https://arxiv.org/abs/2104.14294.
  11. Masked Autoencoders are Scalable Vision Learners. In CVPR, 2022. URL https://arxiv.org/abs/2111.06377.
  12. Kubric: A Scalable Dataset Generator. In CVPR, 2022. URL https://arxiv.org/abs/2203.03570.
  13. Object scene representation transformer. In NeurIPS, 2022. URL https://arxiv.org/abs/2206.06922.
  14. The 3rd large-scale video object segmentation challenge - video instance segmentation track, June 2021. URL https://youtube-vos.org/dataset/vis.
  15. Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects. In NeurIPS, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/7417744a2bac776fabe5a09b21c707a2-Abstract.html.
  16. Spatially invariant unsupervised object detection with convolutional neural networks. In AAAI, 2020. URL https://arxiv.org/abs/1911.09033.
  17. SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition. In ICLR, 2020. URL https://openreview.net/forum?id=rkl03ySYDH.
  18. Neural expectation maximization. In NeurIPS, 2017. URL https://arxiv.org/abs/1708.03498.
  19. Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. In ICLR, 2018. URL https://openreview.net/forum?id=ryH20GbRW.
  20. Entity Abstraction in Visual Model-based Reinforcement Learning. In Conference on Robot Learning, 2019. URL https://arxiv.org/abs/1910.12827.
  21. PARTS: Unsupervised segmentation with slots, attention and independence maximization. In ICCV, 2021. URL https://ieeexplore.ieee.org/document/9711314.
  22. Benchmarking Unsupervised Object Representations for Video Sequences. JMLR, 2021. URL https://jmlr.org/papers/v22/21-0199.html.
  23. SIMONe: View-invariant, temporally-abstracted object representations via unsupervised video decomposition. In NeurIPS, 2021. URL https://openreview.net/forum?id=YSzTMntO1KY.
  24. Learning What and Where: Disentangling Location and Identity Tracking Without Supervision. In ICLR, 2023a. URL https://openreview.net/forum?id=NeDc-Ak-H_.
  25. Multi-object discovery by low-dimensional object motion. In ICCV, 2023. URL https://arxiv.org/abs/2307.08027.
  26. Illiterate DALL-E Learns to Compose. In ICLR, 2022b. URL https://openreview.net/forum?id=h0OYV0We3oh.
  27. ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation. In NeurIPS Track on Datasets and Benchmarks, 2021. URL https://arxiv.org/abs/2111.10265.
  28. Vision meets Robotics: The KITTI Dataset. International Journal of Robotics Research, 2013. URL https://www.cvlibs.net/publications/Geiger2013IJRR.pdf.
  29. Yafei Yang and Bo Yang. Promising or Elusive? Unsupervised Object Segmentation from Real-world Single Images. In NeurIPS, 2022. URL https://openreview.net/forum?id=DzPWTwfby5d.
  30. Invariant slot attention: Object discovery with slot-centric reference frames. In ICML, 2023. URL https://arxiv.org/abs/2302.04973.
  31. Shepherding slots to objects: Towards stable and robust object-centric learning. In CVPR, 2023. URL https://arxiv.org/abs/2303.17842.
  32. Differentiable mathematical programming for object-centric representation learning. In ICLR, 2023. URL https://openreview.net/forum?id=1J-ZTr7aypY.
  33. GENESIS-V2: Inferring Unordered Object Representations without Iterative Refinement. In NeurIPS, 2021. URL https://openreview.net/forum?id=nRBZWEUhIhW.
  34. Object representations as fixed points: Training iterative refinement algorithms with implicit differentiation. In NeurIPS, 2022. URL https://arxiv.org/abs/2207.00787.
  35. Improving object-centric learning with query optimization. In ICLR, 2023. URL https://arxiv.org/abs/2210.08990.
  36. Object-centric slot diffusion. In NeurIPS, 2023. URL https://arxiv.org/abs/2303.10834.
  37. Slotdiffusion: Object-centric generative modeling with diffusion models. In NeurIPS, 2023a. URL https://arxiv.org/abs/2305.11281.
  38. Object discovery from motion-guided tokens. In CVPR, 2023. URL https://arxiv.org/abs/2303.15555.
  39. Masked Siamese Networks for Label-Efficient Learning. In ECCV, 2022. URL https://arxiv.org/abs/2204.07141.
  40. An Empirical Study of Training Self-Supervised Vision Transformers. ICCV, 2021. URL https://arxiv.org/abs/2104.02057.
  41. Semantics meets temporal correspondence: Self-supervised object-centric learning in videos. In ICCV, 2023. URL https://arxiv.org/abs/2308.09951.
  42. Self-supervised object-centric learning for videos. In NeurIPS, 2023. URL https://arxiv.org/abs/2310.06907.
  43. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  44. Unsupervised object learning via common fate. In CLeaR, 2023. URL https://arxiv.org/abs/2110.06562.
  45. Space-time correspondence as a contrastive random walk. In NeurIPS, 2020. URL https://arxiv.org/abs/2006.14613.
  46. Video instance segmentation. In ICCV, 2019. URL https://arxiv.org/abs/1905.04804.
  47. The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017a. URL https://arxiv.org/abs/1704.00675.
  48. Microsoft COCO: Common Objects in Context. In ECCV, 2014. URL https://arxiv.org/abs/1405.0312.
  49. Multiscale Combinatorial Grouping for Image Segmentation and Object Proposal Generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 2017b. doi: 10.1109/TPAMI.2016.2537320. URL https://ieeexplore.ieee.org/document/7423791.
  50. Jason Tyler Rolfe. Discrete variational autoencoders. In ICLR, 2017. URL https://openreview.net/forum?id=ryMxXPFex.
  51. Sparsely changing latent states for prediction and planning in partially observable domains. In NeurIPS, 2021. URL https://arxiv.org/abs/2110.15949.
  52. Slotformer: Unsupervised visual dynamics simulation with object-centric models. In ICLR, 2023b. URL https://openreview.net/forum?id=TFbwV6I0VLg.
  53. Self-supervised Visual Reinforcement Learning with Object-centric Representations. In ICLR, 2020. URL https://openreview.net/forum?id=xppLmXCbOw1.
  54. Compositional Multi-object Reinforcement Learning with Linear Relation Networks. In ICLR Workshop on the Elements of Reasoning: Objects, Structure and Causality, 2022. URL https://openreview.net/forum?id=HFUxPr_I5ec.
  55. Learning dynamic attribute-factored world models for efficient multi-object reinforcement learning. In NeurIPS, 2023. URL https://arxiv.org/abs/2307.09205.
  56. Time does tell: Self-supervised time-tuning of dense image representations. In ICCV, 2023. URL https://arxiv.org/abs/2308.11796.
  57. Segmenting moving objects via an object-centric layered representation. In NeurIPS, 2022. URL https://arxiv.org/abs/2207.02206.
  58. Looping loci: Developing object permanence from videos. 2023b. URL https://arxiv.org/abs/2310.10372.
  59. DINOv2: Learning robust visual features without supervision, 2023. URL https://arxiv.org/abs/2304.07193.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Andrii Zadaianchuk (11 papers)
  2. Maximilian Seitzer (12 papers)
  3. Georg Martius (86 papers)
Citations (23)

Summary

We haven't generated a summary for this paper yet.