SGTR+: End-to-end Scene Graph Generation with Transformer (2401.12835v1)
Abstract: Scene Graph Generation (SGG) remains a challenging visual understanding task due to its compositional property. Most previous works adopt a bottom-up, two-stage or point-based, one-stage approach, which often suffers from high time complexity or suboptimal designs. In this work, we propose a novel SGG method to address the aforementioned issues, formulating the task as a bipartite graph construction problem. To address the issues above, we create a transformer-based end-to-end framework to generate the entity and entity-aware predicate proposal set, and infer directed edges to form relation triplets. Moreover, we design a graph assembling module to infer the connectivity of the bipartite scene graph based on our entity-aware structure, enabling us to generate the scene graph in an end-to-end manner. Based on bipartite graph assembling paradigm, we further propose a new technical design to address the efficacy of entity-aware modeling and optimization stability of graph assembling. Equipped with the enhanced entity-aware design, our method achieves optimal performance and time-complexity. Extensive experimental results show that our design is able to achieve the state-of-the-art or comparable performance on three challenging benchmarks, surpassing most of the existing approaches and enjoying higher efficiency in inference. Code is available: https://github.com/Scarecrow0/SGTR
- D. Teney, L. Liu, and A. van Den Hengel, “Graph-structured representations for visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1–9.
- J. Shi, H. Zhang, and J. Li, “Explainable and explicit visual reasoning over scene graphs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8376–8384.
- M. Hildebrandt, H. Li, R. Koner, V. Tresp, and S. Günnemann, “Scene graph reasoning for visual question answering,” arXiv preprint arXiv:2007.01072, 2020.
- X. Chang, P. Ren, P. Xu, Z. Li, X. Chen, and A. Hauptmann, “A comprehensive survey of scene graphs: Generation and application,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 1–26, 2021.
- X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 685–10 694.
- X. Yang, Y. Liu, and X. Wang, “Reformer: The relational transformer for image captioning,” arXiv preprint arXiv:2107.14178, 2021.
- J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, and L. Fei-Fei, “Image retrieval using scene graphs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3668–3678.
- R. Li, S. Zhang, B. Wan, and X. He, “Bipartite graph network with adaptive message passing for unbiased scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 109–11 119.
- G. Yang, J. Zhang, Y. Zhang, B. Wu, and Y. Yang, “Probabilistic modeling of semantic ambiguity for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 527–12 536.
- Y. Yao, A. Zhang, X. Han, M. Li, C. Weber, Z. Liu, S. Wermter, and M. Sun, “Visual distant supervision for scene graph generation,” arXiv preprint arXiv:2103.15365, 2021.
- A. Desai, T.-Y. Wu, S. Tripathi, and N. Vasconcelos, “Learning of visual relations: The devil is in the tails,” arXiv preprint arXiv:2108.09668, 2021.
- M.-J. Chiou, H. Ding, H. Yan, C. Wang, R. Zimmermann, and J. Feng, “Recovering the unbiased scene graphs from the biased ones,” arXiv preprint arXiv:2107.02112, 2021.
- Y. Guo, L. Gao, X. Wang, Y. Hu, X. Xu, X. Lu, H. T. Shen, and J. Song, “From general to specific: Informative scene graph generation via balance adjustment,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16 383–16 392.
- B. Knyazev, H. de Vries, C. Cangea, G. W. Taylor, A. Courville, and E. Belilovsky, “Generative compositional augmentations for scene graph prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 827–15 837.
- S. Abdelkarim, A. Agarwal, P. Achlioptas, J. Chen, J. Huang, B. Li, K. Church, and M. Elhoseiny, “Exploring long tail visual relationship recognition with large vocabulary,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 921–15 930.
- H. Liu, N. Yan, M. Mortazavi, and B. Bhanu, “Fully convolutional scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 546–11 556.
- Q. Dong, Z. Tu, H. Liao, Y. Zhang, V. Mahadevan, and S. Soatto, “Visual relationship detection using part-and-sum transformers with composite queries,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3550–3559.
- M. Chen, Y. Liao, S. Liu, Z. Chen, F. Wang, and C. Qian, “Reformulating hoi detection as adaptive set prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9004–9013.
- M. Tamura, H. Ohashi, and T. Yoshinaga, “Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 410–10 419.
- R. Li, S. Zhang, and X. He, “Sgtr: End-to-end scene graph generation with transformer,” in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 19 486–19 496.
- R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, no. 1, pp. 32–73, 2017.
- A. Kuznetsova, H. Rom, N. Alldrin, J. Uijling s, I. Krasin, J. Pont-Tuset, S. Kamali, S. Po pov, M. Malloci, A. Kolesnikov et al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” International Journal of Computer Vision, vol. 128, no. 7, pp. 1956–1981, 2020.
- D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6700–6709.
- R. Zellers, M. Yatskar, S. Thomson, and Y. Choi, “Neural motifs: Scene graph parsing with global context,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5831–5840.
- D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei, “Scene graph generation by iterative message passing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 5410–5419.
- Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang, “Scene graph generation from objects, phrases and region captions,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1261–1270.
- S. Woo, D. Kim, D. Cho, and I. S. Kweon, “Linknet: Relational embedding for scene graph,” in Advances in Neural Information Processing Systems, 2018, pp. 560–570.
- Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang, “Factorizable net: an efficient subgraph-based framework for scene graph generation,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 335–351.
- G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, J. Shao, and C. Change Loy, “Zoom-net: Mining deep feature interactions for visual relationship recognition,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 322–338.
- W. Wang, R. Wang, S. Shan, and X. Chen, “Exploring context and visual pattern of relationship for scene graph generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8188–8197.
- X. Lin, C. Ding, J. Zeng, and D. Tao, “Gps-net: Graph property sensing network for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3746–3753.
- C. Yuren, H. Ackermann, W. Liao, M. Y. Yang, and B. Rosenhahn, “Nodis: Neural ordinary differential scene understanding,” arXiv preprint arXiv:2001.04735, 2020.
- T. Wang, S. Pehlivan, and J. T. Laaksonen, “Tackling the unannotated: Scene graph generation with bias-reduce d models,” in The 31th British Machine Vision Conference, 2020, pp. 1–13.
- S. Khandelwal, M. Suhail, and L. Sigal, “Segmentation-grounded scene graph generation,” arXiv preprint arXiv:2104.14207, 2021.
- K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu, “Learning to compose dynamic tree structures for visual contexts,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6619–6628.
- M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo, “Attentive relational networks for mapping images to scene graphs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3957–3966.
- J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for scene graph generation,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 670–685.
- X. Lin, C. Ding, J. Zhang, Y. Zhan, and D. Tao, “Ru-net: Regularized unrolling network for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 457–19 466.
- A. Zareian, S. Karaman, and S.-F. Chang, “Bridging knowledge graphs to generate scene graphs,” in European Conference on Computer Vision, 2020, pp. 606–623.
- A. Zareian, S. Karaman, and S. F. Chang, “Weakly supervised visual semantic parsing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3736–3745.
- A. Zareian, Z. Wang, H. You, and S. F. Chang, “Learning visual commonsense for robust scene graph generation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16. Springer, 2020, pp. 642–657.
- K. Tang, Y. Niu, J. Huang, J. Shi, and H. Zhang, “Unbiased scene graph generation from biased training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3716–3725.
- M. Suhail, A. Mittal, B. Siddiquie, C. Broaddus, J. Eledath, G. Medioni, and L. Sigal, “Energy-based learning for scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 936–13 945.
- S. Yan, C. Shen, Z. Jin, J. Huang, R. Jiang, Y. Chen, and X.-S. Hua, “Pcpl: Predicate-correlation perception learning for unbiased scene graph generation,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 265–273.
- S. Sun, S. Zhi, Q. Liao, J. Heikkilä, and L. Liu, “Unbiased scene graph generation via two-stage causal modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- D. Liu, M. Bober, and J. Kittler, “Neural belief propagation for scene graph generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- B. Knyazev, H. De Vries, C. n. Cangea, G. W. Taylor, A. Courville, and E. Belilovsky, “Graph density-aware losses for novel compositions in scene graph generation,” in The 31th British Machine Vision Conference, 2020, pp. 1–13.
- W. Li, H. Zhang, Q. Bai, G. Zhao, N. Jiang, and X. Yuan, “Ppdl: Predicate probability distribution based loss for unbiased scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 447–19 456.
- L. Li, L. Chen, Y. Huang, Z. Zhang, S. Zhang, and J. Xiao, “The devil is in the labels: Noisy label correction for robust scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 869–18 878.
- X. Dong, T. Gan, X. Song, J. Wu, Y. Cheng, and L. Nie, “Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 427–19 436.
- Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9627–9636.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European Conference on Computer Vision. Springer, 2020, pp. 213–229.
- P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang et al., “Sparse r-cnn: End-to-end object detection with learnable proposals,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 454–14 463.
- Y. Teng and L. Wang, “Structured sparse r-cnn for direct scene graph generation,” arXiv preprint arXiv:2106.10815, 2021.
- J. Yang, Y. Z. Ang, Z. Guo, K. Zhou, W. Zhang, and Z. Liu, “Panoptic scene graph generation,” arXiv preprint arXiv:2207.11247, 2022.
- Y. Cong, M. Y. Yang, and B. Rosenhahn, “Reltr: Relation transformer for scene graph generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- S. Khandelwal and L. Sigal, “Iterative scene graph generation,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 295–24 308, 2022.
- Y.-L. Li, X. Liu, X. Wu, X. Huang, L. Xu, and C. Lu, “Transferable interactiveness knowledge for human-object interaction detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 7, pp. 3870–3882, 2021.
- Y. Liao, S. Liu, F. Wang, Y. Chen, C. Qian, and J. Feng, “Ppdm: Parallel point detection and matching for real-time human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 482–490.
- B. Kim, T. Choi, J. Kang, and H. J. Kim, “Uniondet: Union-level detector towards real-time human-object interaction detection,” in European Conference on Computer Vision. Springer, 2020, pp. 498–514.
- T. Wang, T. Yang, M. Danelljan, F. S. Khan, X. Zhang, and J. Sun, “Learning human-object interaction detection using interaction points,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4116–4125.
- C. Zou, B. Wang, Y. Hu, J. Liu, Q. Wu, Y. Zhao, B. Li, C. Zhang, C. Zhang, Y. Wei et al., “End-to-end human object interaction detection with hoi transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 825–11 834.
- B. Kim, J. Lee, J. Kang, E.-S. Kim, and H. J. Kim, “Hotr: End-to-end human-object interaction detection with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 74–83.
- A. Zhang, Y. Liao, S. Liu, M. Lu, Y. Wang, C. Gao, and X. Li, “Mining the benefits of two-stage and one-stage hoi detection,” arXiv preprint arXiv:2108.05077, 2021.
- F. Z. Zhang, D. Campbell, and S. Gould, “Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 104–20 112.
- J. Chen and K. Yanai, “Qahoi: Query-based anchors for human-object interaction detection,” arXiv preprint arXiv:2112.08647, 2021.
- J. Park, S. Lee, H. Heo, H. K. Choi, and H. J. Kim, “Consistency learning via decoding path augmentation for transformers in human object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1019–1028.
- Y. Liao, A. Zhang, M. Lu, Y. Wang, X. Li, and S. Liu, “Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 123–20 132.
- Z. Yao, J. Ai, B. Li, and C. Zhang, “Efficient detr: Improving end-to-end object detector with dense prior,” arXiv preprint arXiv:2104.01318, 2021.
- H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
- A. Gupta, P. Dollar, and R. Girshick, “Lvis: A dataset for large vocabulary instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5356–5364.
- A. K. Menon, S. Jayasumana, A. S. Rawat, H. Jain, A. Veit, and S. Kumar, “Long-tail learning via logit adjustment,” arXiv preprint arXiv:2007.07314, 2020.
- X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv preprint arXiv:1904.07850, 2019.
- S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,” arXiv preprint arXiv:2006.04768, 2020.
- X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020.
- H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6824–6835.
- Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor detr: Query design for transformer-based detector,” arXiv preprint arXiv:2109.07107, 2021.
- D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang, “Conditional detr for fast training convergence,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3651–3660.
- B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and Y. Kalantidis, “Decoupling representation and classifier for long-tailed recognition,” in International Conference on Learning Representations, 2019.
- S. Zhang, Z. Li, S. Yan, X. He, and J. Sun, “Distribution alignment: A unified framework for long-tail visual recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 2361–2370.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
- B. Zhu*, F. Wang*, J. Wang, S. Yang, J. Chen, and Z. Li, “cvpods: All-in-one toolbox for computer vision research,” 2020.
- J. Zhang, M. Elhoseiny, S. Cohen, W. Chang, and A. Elgammal, “Relationship proposal networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5226–5234.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.