Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities (2306.04829v2)
Abstract: Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.
- MONet: Unsupervised Scene Decomposition and Representation. arXiv:1901.11390, 2019. URL https://arxiv.org/abs/1901.11390.
- Multi-Object Representation Learning with Iterative Variational Inference. In ICML, 2019. URL https://arxiv.org/abs/1903.00450.
- Object-Centric Learning with Slot Attention. In NeurIPS, 2020. URL https://proceedings.neurips.cc/paper/2020/file/8511df98c02ab60aea1b2356c013bc0f-Paper.pdf.
- SCALOR: Generative World Models with Scalable Object Representations. In ICLR, 2020. URL https://openreview.net/pdf?id=SJxrKgStDH.
- Conditional Object-centric Learning from Video. In ICLR, 2022. URL https://openreview.net/forum?id=aD7uesX1GF_.
- SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos. In NeurIPS, 2022. URL https://openreview.net/forum?id=fT9W53lLxNS.
- Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos. In NeurIPS, 2022a. URL https://openreview.net/forum?id=eYfIM88MTUE.
- Discovering Objects that Can Move. CVPR, 2022. URL https://arxiv.org/abs/2203.10159.
- Bridging the gap to real-world object-centric learning. In ICLR, 2023. URL https://openreview.net/forum?id=b9tUk-f_aG.
- Emerging Properties in Self-Supervised Vision Transformers. ICCV, 2021. URL https://arxiv.org/abs/2104.14294.
- Masked Autoencoders are Scalable Vision Learners. In CVPR, 2022. URL https://arxiv.org/abs/2111.06377.
- Kubric: A Scalable Dataset Generator. In CVPR, 2022. URL https://arxiv.org/abs/2203.03570.
- Object scene representation transformer. In NeurIPS, 2022. URL https://arxiv.org/abs/2206.06922.
- The 3rd large-scale video object segmentation challenge - video instance segmentation track, June 2021. URL https://youtube-vos.org/dataset/vis.
- Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects. In NeurIPS, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/7417744a2bac776fabe5a09b21c707a2-Abstract.html.
- Spatially invariant unsupervised object detection with convolutional neural networks. In AAAI, 2020. URL https://arxiv.org/abs/1911.09033.
- SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition. In ICLR, 2020. URL https://openreview.net/forum?id=rkl03ySYDH.
- Neural expectation maximization. In NeurIPS, 2017. URL https://arxiv.org/abs/1708.03498.
- Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. In ICLR, 2018. URL https://openreview.net/forum?id=ryH20GbRW.
- Entity Abstraction in Visual Model-based Reinforcement Learning. In Conference on Robot Learning, 2019. URL https://arxiv.org/abs/1910.12827.
- PARTS: Unsupervised segmentation with slots, attention and independence maximization. In ICCV, 2021. URL https://ieeexplore.ieee.org/document/9711314.
- Benchmarking Unsupervised Object Representations for Video Sequences. JMLR, 2021. URL https://jmlr.org/papers/v22/21-0199.html.
- SIMONe: View-invariant, temporally-abstracted object representations via unsupervised video decomposition. In NeurIPS, 2021. URL https://openreview.net/forum?id=YSzTMntO1KY.
- Learning What and Where: Disentangling Location and Identity Tracking Without Supervision. In ICLR, 2023a. URL https://openreview.net/forum?id=NeDc-Ak-H_.
- Multi-object discovery by low-dimensional object motion. In ICCV, 2023. URL https://arxiv.org/abs/2307.08027.
- Illiterate DALL-E Learns to Compose. In ICLR, 2022b. URL https://openreview.net/forum?id=h0OYV0We3oh.
- ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation. In NeurIPS Track on Datasets and Benchmarks, 2021. URL https://arxiv.org/abs/2111.10265.
- Vision meets Robotics: The KITTI Dataset. International Journal of Robotics Research, 2013. URL https://www.cvlibs.net/publications/Geiger2013IJRR.pdf.
- Yafei Yang and Bo Yang. Promising or Elusive? Unsupervised Object Segmentation from Real-world Single Images. In NeurIPS, 2022. URL https://openreview.net/forum?id=DzPWTwfby5d.
- Invariant slot attention: Object discovery with slot-centric reference frames. In ICML, 2023. URL https://arxiv.org/abs/2302.04973.
- Shepherding slots to objects: Towards stable and robust object-centric learning. In CVPR, 2023. URL https://arxiv.org/abs/2303.17842.
- Differentiable mathematical programming for object-centric representation learning. In ICLR, 2023. URL https://openreview.net/forum?id=1J-ZTr7aypY.
- GENESIS-V2: Inferring Unordered Object Representations without Iterative Refinement. In NeurIPS, 2021. URL https://openreview.net/forum?id=nRBZWEUhIhW.
- Object representations as fixed points: Training iterative refinement algorithms with implicit differentiation. In NeurIPS, 2022. URL https://arxiv.org/abs/2207.00787.
- Improving object-centric learning with query optimization. In ICLR, 2023. URL https://arxiv.org/abs/2210.08990.
- Object-centric slot diffusion. In NeurIPS, 2023. URL https://arxiv.org/abs/2303.10834.
- Slotdiffusion: Object-centric generative modeling with diffusion models. In NeurIPS, 2023a. URL https://arxiv.org/abs/2305.11281.
- Object discovery from motion-guided tokens. In CVPR, 2023. URL https://arxiv.org/abs/2303.15555.
- Masked Siamese Networks for Label-Efficient Learning. In ECCV, 2022. URL https://arxiv.org/abs/2204.07141.
- An Empirical Study of Training Self-Supervised Vision Transformers. ICCV, 2021. URL https://arxiv.org/abs/2104.02057.
- Semantics meets temporal correspondence: Self-supervised object-centric learning in videos. In ICCV, 2023. URL https://arxiv.org/abs/2308.09951.
- Self-supervised object-centric learning for videos. In NeurIPS, 2023. URL https://arxiv.org/abs/2310.06907.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
- Unsupervised object learning via common fate. In CLeaR, 2023. URL https://arxiv.org/abs/2110.06562.
- Space-time correspondence as a contrastive random walk. In NeurIPS, 2020. URL https://arxiv.org/abs/2006.14613.
- Video instance segmentation. In ICCV, 2019. URL https://arxiv.org/abs/1905.04804.
- The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017a. URL https://arxiv.org/abs/1704.00675.
- Microsoft COCO: Common Objects in Context. In ECCV, 2014. URL https://arxiv.org/abs/1405.0312.
- Multiscale Combinatorial Grouping for Image Segmentation and Object Proposal Generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 2017b. doi: 10.1109/TPAMI.2016.2537320. URL https://ieeexplore.ieee.org/document/7423791.
- Jason Tyler Rolfe. Discrete variational autoencoders. In ICLR, 2017. URL https://openreview.net/forum?id=ryMxXPFex.
- Sparsely changing latent states for prediction and planning in partially observable domains. In NeurIPS, 2021. URL https://arxiv.org/abs/2110.15949.
- Slotformer: Unsupervised visual dynamics simulation with object-centric models. In ICLR, 2023b. URL https://openreview.net/forum?id=TFbwV6I0VLg.
- Self-supervised Visual Reinforcement Learning with Object-centric Representations. In ICLR, 2020. URL https://openreview.net/forum?id=xppLmXCbOw1.
- Compositional Multi-object Reinforcement Learning with Linear Relation Networks. In ICLR Workshop on the Elements of Reasoning: Objects, Structure and Causality, 2022. URL https://openreview.net/forum?id=HFUxPr_I5ec.
- Learning dynamic attribute-factored world models for efficient multi-object reinforcement learning. In NeurIPS, 2023. URL https://arxiv.org/abs/2307.09205.
- Time does tell: Self-supervised time-tuning of dense image representations. In ICCV, 2023. URL https://arxiv.org/abs/2308.11796.
- Segmenting moving objects via an object-centric layered representation. In NeurIPS, 2022. URL https://arxiv.org/abs/2207.02206.
- Looping loci: Developing object permanence from videos. 2023b. URL https://arxiv.org/abs/2310.10372.
- DINOv2: Learning robust visual features without supervision, 2023. URL https://arxiv.org/abs/2304.07193.
- Andrii Zadaianchuk (11 papers)
- Maximilian Seitzer (12 papers)
- Georg Martius (86 papers)