Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 10 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities (2306.04829v2)

Published 7 Jun 2023 in cs.CV and cs.LG

Abstract: Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. MONet: Unsupervised Scene Decomposition and Representation. arXiv:1901.11390, 2019. URL https://arxiv.org/abs/1901.11390.
  2. Multi-Object Representation Learning with Iterative Variational Inference. In ICML, 2019. URL https://arxiv.org/abs/1903.00450.
  3. Object-Centric Learning with Slot Attention. In NeurIPS, 2020. URL https://proceedings.neurips.cc/paper/2020/file/8511df98c02ab60aea1b2356c013bc0f-Paper.pdf.
  4. SCALOR: Generative World Models with Scalable Object Representations. In ICLR, 2020. URL https://openreview.net/pdf?id=SJxrKgStDH.
  5. Conditional Object-centric Learning from Video. In ICLR, 2022. URL https://openreview.net/forum?id=aD7uesX1GF_.
  6. SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos. In NeurIPS, 2022. URL https://openreview.net/forum?id=fT9W53lLxNS.
  7. Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos. In NeurIPS, 2022a. URL https://openreview.net/forum?id=eYfIM88MTUE.
  8. Discovering Objects that Can Move. CVPR, 2022. URL https://arxiv.org/abs/2203.10159.
  9. Bridging the gap to real-world object-centric learning. In ICLR, 2023. URL https://openreview.net/forum?id=b9tUk-f_aG.
  10. Emerging Properties in Self-Supervised Vision Transformers. ICCV, 2021. URL https://arxiv.org/abs/2104.14294.
  11. Masked Autoencoders are Scalable Vision Learners. In CVPR, 2022. URL https://arxiv.org/abs/2111.06377.
  12. Kubric: A Scalable Dataset Generator. In CVPR, 2022. URL https://arxiv.org/abs/2203.03570.
  13. Object scene representation transformer. In NeurIPS, 2022. URL https://arxiv.org/abs/2206.06922.
  14. The 3rd large-scale video object segmentation challenge - video instance segmentation track, June 2021. URL https://youtube-vos.org/dataset/vis.
  15. Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects. In NeurIPS, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/7417744a2bac776fabe5a09b21c707a2-Abstract.html.
  16. Spatially invariant unsupervised object detection with convolutional neural networks. In AAAI, 2020. URL https://arxiv.org/abs/1911.09033.
  17. SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition. In ICLR, 2020. URL https://openreview.net/forum?id=rkl03ySYDH.
  18. Neural expectation maximization. In NeurIPS, 2017. URL https://arxiv.org/abs/1708.03498.
  19. Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. In ICLR, 2018. URL https://openreview.net/forum?id=ryH20GbRW.
  20. Entity Abstraction in Visual Model-based Reinforcement Learning. In Conference on Robot Learning, 2019. URL https://arxiv.org/abs/1910.12827.
  21. PARTS: Unsupervised segmentation with slots, attention and independence maximization. In ICCV, 2021. URL https://ieeexplore.ieee.org/document/9711314.
  22. Benchmarking Unsupervised Object Representations for Video Sequences. JMLR, 2021. URL https://jmlr.org/papers/v22/21-0199.html.
  23. SIMONe: View-invariant, temporally-abstracted object representations via unsupervised video decomposition. In NeurIPS, 2021. URL https://openreview.net/forum?id=YSzTMntO1KY.
  24. Learning What and Where: Disentangling Location and Identity Tracking Without Supervision. In ICLR, 2023a. URL https://openreview.net/forum?id=NeDc-Ak-H_.
  25. Multi-object discovery by low-dimensional object motion. In ICCV, 2023. URL https://arxiv.org/abs/2307.08027.
  26. Illiterate DALL-E Learns to Compose. In ICLR, 2022b. URL https://openreview.net/forum?id=h0OYV0We3oh.
  27. ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation. In NeurIPS Track on Datasets and Benchmarks, 2021. URL https://arxiv.org/abs/2111.10265.
  28. Vision meets Robotics: The KITTI Dataset. International Journal of Robotics Research, 2013. URL https://www.cvlibs.net/publications/Geiger2013IJRR.pdf.
  29. Yafei Yang and Bo Yang. Promising or Elusive? Unsupervised Object Segmentation from Real-world Single Images. In NeurIPS, 2022. URL https://openreview.net/forum?id=DzPWTwfby5d.
  30. Invariant slot attention: Object discovery with slot-centric reference frames. In ICML, 2023. URL https://arxiv.org/abs/2302.04973.
  31. Shepherding slots to objects: Towards stable and robust object-centric learning. In CVPR, 2023. URL https://arxiv.org/abs/2303.17842.
  32. Differentiable mathematical programming for object-centric representation learning. In ICLR, 2023. URL https://openreview.net/forum?id=1J-ZTr7aypY.
  33. GENESIS-V2: Inferring Unordered Object Representations without Iterative Refinement. In NeurIPS, 2021. URL https://openreview.net/forum?id=nRBZWEUhIhW.
  34. Object representations as fixed points: Training iterative refinement algorithms with implicit differentiation. In NeurIPS, 2022. URL https://arxiv.org/abs/2207.00787.
  35. Improving object-centric learning with query optimization. In ICLR, 2023. URL https://arxiv.org/abs/2210.08990.
  36. Object-centric slot diffusion. In NeurIPS, 2023. URL https://arxiv.org/abs/2303.10834.
  37. Slotdiffusion: Object-centric generative modeling with diffusion models. In NeurIPS, 2023a. URL https://arxiv.org/abs/2305.11281.
  38. Object discovery from motion-guided tokens. In CVPR, 2023. URL https://arxiv.org/abs/2303.15555.
  39. Masked Siamese Networks for Label-Efficient Learning. In ECCV, 2022. URL https://arxiv.org/abs/2204.07141.
  40. An Empirical Study of Training Self-Supervised Vision Transformers. ICCV, 2021. URL https://arxiv.org/abs/2104.02057.
  41. Semantics meets temporal correspondence: Self-supervised object-centric learning in videos. In ICCV, 2023. URL https://arxiv.org/abs/2308.09951.
  42. Self-supervised object-centric learning for videos. In NeurIPS, 2023. URL https://arxiv.org/abs/2310.06907.
  43. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  44. Unsupervised object learning via common fate. In CLeaR, 2023. URL https://arxiv.org/abs/2110.06562.
  45. Space-time correspondence as a contrastive random walk. In NeurIPS, 2020. URL https://arxiv.org/abs/2006.14613.
  46. Video instance segmentation. In ICCV, 2019. URL https://arxiv.org/abs/1905.04804.
  47. The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017a. URL https://arxiv.org/abs/1704.00675.
  48. Microsoft COCO: Common Objects in Context. In ECCV, 2014. URL https://arxiv.org/abs/1405.0312.
  49. Multiscale Combinatorial Grouping for Image Segmentation and Object Proposal Generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 2017b. doi: 10.1109/TPAMI.2016.2537320. URL https://ieeexplore.ieee.org/document/7423791.
  50. Jason Tyler Rolfe. Discrete variational autoencoders. In ICLR, 2017. URL https://openreview.net/forum?id=ryMxXPFex.
  51. Sparsely changing latent states for prediction and planning in partially observable domains. In NeurIPS, 2021. URL https://arxiv.org/abs/2110.15949.
  52. Slotformer: Unsupervised visual dynamics simulation with object-centric models. In ICLR, 2023b. URL https://openreview.net/forum?id=TFbwV6I0VLg.
  53. Self-supervised Visual Reinforcement Learning with Object-centric Representations. In ICLR, 2020. URL https://openreview.net/forum?id=xppLmXCbOw1.
  54. Compositional Multi-object Reinforcement Learning with Linear Relation Networks. In ICLR Workshop on the Elements of Reasoning: Objects, Structure and Causality, 2022. URL https://openreview.net/forum?id=HFUxPr_I5ec.
  55. Learning dynamic attribute-factored world models for efficient multi-object reinforcement learning. In NeurIPS, 2023. URL https://arxiv.org/abs/2307.09205.
  56. Time does tell: Self-supervised time-tuning of dense image representations. In ICCV, 2023. URL https://arxiv.org/abs/2308.11796.
  57. Segmenting moving objects via an object-centric layered representation. In NeurIPS, 2022. URL https://arxiv.org/abs/2207.02206.
  58. Looping loci: Developing object permanence from videos. 2023b. URL https://arxiv.org/abs/2310.10372.
  59. DINOv2: Learning robust visual features without supervision, 2023. URL https://arxiv.org/abs/2304.07193.
Citations (23)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.