Panoptic Video Scene Graph Generation (2311.17058v1)
Abstract: Towards building comprehensive real-world visual perception systems, we propose and study a new problem called panoptic scene graph generation (PVSG). PVSG relates to the existing video scene graph generation (VidSGG) problem, which focuses on temporal interactions between humans and objects grounded with bounding boxes in videos. However, the limitation of bounding boxes in detecting non-rigid objects and backgrounds often causes VidSGG to miss key details crucial for comprehensive video understanding. In contrast, PVSG requires nodes in scene graphs to be grounded by more precise, pixel-level segmentation masks, which facilitate holistic scene understanding. To advance research in this new area, we contribute the PVSG dataset, which consists of 400 videos (289 third-person + 111 egocentric videos) with a total of 150K frames labeled with panoptic segmentation masks as well as fine, temporal scene graphs. We also provide a variety of baseline methods and share useful design practices for future work.
- Adaptive image-to-video scene graph generation via knowledge reasoning and adversarial learning. 2022.
- Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020.
- Masked-attention mask transformer for universal image segmentation. 2022.
- Reltr: Relation transformer for scene graph generation. arXiv preprint arXiv:2201.11460, 2022.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
- Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018.
- Epic-kitchens-100. International Journal of Computer Vision, 130:33–55, 2022.
- Epic-kitchens visor benchmark: Video segmentations and object relations. In NeurIPS, 2022.
- Baconian: A unified open-source framework for model-based reinforcement learning, 2021.
- Learning perceptual causality from video. ACM Transactions on Intelligent Systems and Technology (TIST), 7(2):1–22, 2015.
- The" something something" video database for learning and evaluating visual common sense. In ICCV, 2017.
- Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Going deeper into action recognition: A survey. Image and vision computing, 60:4–21, 2017.
- Action genome: Actions as compositions of spatio-temporal scene graphs. In CVPR, 2020.
- Image retrieval using scene graphs. In CVPR, 2015.
- Video panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9859–9868, 2020.
- Video panoptic segmentation. In CVPR, 2020.
- Tubeformer-deeplab: Video mask transformer. In CVPR, 2022.
- Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
- Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696, 2018.
- The devil is in the labels: Noisy label correction for robust scene graph generation. In CVPR, 2022.
- Sgtr: End-to-end scene graph generation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19486–19496, 2022.
- Embodied semantic scene graph generation. In Conference on Robot Learning, pages 1585–1594. PMLR, 2022.
- Video k-net: A simple, strong, and unified baseline for video segmentation. In CVPR, 2022.
- Beyond short-term snippet: Video relation detection with spatio-temporal global context. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10840–10849, 2020.
- Moma: Multi-object multi-actor activity parsing. NeurIPS, 2021.
- Large-scale video panoptic segmentation in the wild: A benchmark. In CVPR, 2022.
- HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV, 2019.
- Sense the physical, walkthrough the virtual, manage the metaverse: A data-centric perspective. arXiv preprint arXiv:2206.10326, 2022.
- Video relation detection with spatio-temporal graph. In Proceedings of the 27th ACM International Conference on Multimedia, pages 84–93, 2019.
- Vip-deeplab: Learning visual perception with depth-aware video panoptic segmentation. CVPR, 2021.
- Imagenet large scale visual recognition challenge. IJCV, 2015.
- Annotating objects and relations in user-generated videos. In ICMR, 2019.
- Video visual relation detection. In ACM MM, 2017.
- Hollywood in homes: Crowdsourcing data collection for activity understanding. In ECCV, 2016.
- Zlpr: A novel loss for multi-label classification. arXiv preprint arXiv:2208.02955, 2022.
- Video relation detection via multiple hypothesis association. In Proceedings of the 28th ACM International Conference on Multimedia, pages 3127–3135, 2020.
- Energy-based learning for scene graph generation. In CVPR, 2021.
- Learning to compose dynamic tree structures for visual contexts. In CVPR, 2019.
- Human-centric spatio-temporal video grounding with visual transformers. IEEE TCSVT, 2021.
- Target adaptive context aggregation for video scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13688–13697, 2021.
- Yfcc100m: The new data in multimedia research. Communications of the ACM, 2016.
- Attention is all you need. NeurIPS, 2017.
- Pair then relation: Pair-net for panoptic scene graph generation. arXiv preprint arXiv:2307.08699, 2023.
- Do different tracking tasks require different appearance models? NeurIPS, 2021.
- Step: Segmenting and tracking every pixel. arXiv preprint arXiv:2102.11859, 2021.
- Step: Segmenting and tracking every pixel. NIPS, 2021.
- Learning to associate every segment for video panoptic segmentation. In CVPR, 2021.
- Scene graph generation by iterative message passing. In CVPR, 2017.
- Meta spatio-temporal debiasing for video scene graph generation. In European Conference on Computer Vision, pages 374–390. Springer, 2022.
- Panoptic scene graph generation. In European Conference on Computer Vision, pages 178–196. Springer, 2022.
- Graph r-cnn for scene graph generation. In ECCV, 2018.
- Video instance segmentation. In ICCV, 2019.
- Associating objects with transformers for video object segmentation. In NeurIPS, 2021.
- Egocentric vision-based future vehicle localization for intelligent driving assistance systems. In 2019 International Conference on Robotics and Automation (ICRA), pages 9711–9717. IEEE, 2019.
- Polyphonicformer: Unified query learning for depth-aware video panoptic segmentation. In ECCV, 2022.
- Neural motifs: Scene graph parsing with global context. In CVPR, 2018.
- Video demo: An egocentric vision based assistive co-robot. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 48–49, 2013.
- Learning to generate scene graph from natural language supervision. In ICCV, 2021.
- Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision (ECCV), pages 803–818, 2018.