Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 157 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 88 tok/s Pro
Kimi K2 160 tok/s Pro
GPT OSS 120B 397 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR (2402.14461v2)

Published 22 Feb 2024 in cs.CV

Abstract: Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR). However, previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. This pipeline may potentially compromise the flexibility of learning multimodal representations, consequently constraining the overall effectiveness. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR, aimed to complementally leverage multi-view 2D scenes and 3D point clouds for SGG in an end-to-end manner. Concretely, our model embraces a View-Sync Transfusion scheme to encourage multi-view visual information interaction. Concurrently, a Geometry-Visual Cohesion operation is designed to integrate the synergic 2D semantic features into 3D point cloud features. Moreover, based on the augmented feature, we propose a novel relation-sensitive transformer decoder that embeds dynamic entity-pair queries and relational trait priors, which enables the direct prediction of entity-pair relations for graph generation without intermediate steps. Extensive experiments have validated the superior SGG performance and lower computational cost of S2Former-OR on 4D-OR benchmark, compared with current OR-SGG methods, e.g., 3 percentage points Precision increase and 24.2M reduction in model parameters. We further compared our method with generic single-stage SGG methods with broader metrics for a comprehensive evaluation, with consistently better performance achieved.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. F. Lalys and P. Jannin, “Surgical process modelling: a review,” Int. J. Comput. Assist. Radiol. Surg., vol. 9, no. 3, pp. 495–511, 2014.
  2. L. Maier-Hein, S. S. Vedula, S. Speidel, N. Navab, R. Kikinis, A. Park, M. Eisenmann, H. Feussner, G. Forestier, S. Giannarou et al., “Surgical data science for next-generation interventions,” Nature Biomedical Engineering, vol. 1, no. 9, pp. 691–696, 2017.
  3. L. R. Kennedy-Metz, P. Mascagni, A. Torralba, R. D. Dias, P. Perona, J. A. Shah, N. Padoy, and M. A. Zenati, “Computer vision in the operating room: Opportunities and caveats,” IEEE TMRB, vol. 3, no. 1, pp. 2–10, 2020.
  4. X. Chang, P. Ren, P. Xu, Z. Li, X. Chen, and A. Hauptmann, “A comprehensive survey of scene graphs: Generation and application,” IEEE TPAMI, vol. 45, no. 1, pp. 1–26, 2021.
  5. E. Özsoy, T. Czempiel, F. Holm, C. Pellegrini, and N. Navab, “Labrad-or: Lightweight memory scene graphs for accurate bimodal reasoning in dynamic operating rooms,” in MICCAI, 2023, pp. 302–311.
  6. E. Özsoy, E. P. Örnek, U. Eck, T. Czempiel, F. Tombari, and N. Navab, “4d-or: Semantic scene graphs for or domain modeling,” in MICCAI, 2022, pp. 475–485.
  7. J. Wald, H. Dhamo, N. Navab, and F. Tombari, “Learning 3d semantic scene graphs from 3d indoor reconstructions,” in IEEE CVPR, 2020, pp. 3961–3970.
  8. R. Li, S. Zhang, B. Wan, and X. He, “Bipartite graph network with adaptive message passing for unbiased scene graph generation,” in IEEE CVPR, 2021, pp. 11 109–11 119.
  9. X. Lin, C. Ding, J. Zeng, and D. Tao, “Gps-net: Graph property sensing network for scene graph generation,” in IEEE CVPR, 2020, pp. 3746–3753.
  10. K. Tang, Y. Niu, J. Huang, J. Shi, and H. Zhang, “Unbiased scene graph generation from biased training,” in IEEE CVPR, 2020, pp. 3716–3725.
  11. N. Dhingra, F. Ritter, and A. Kunz, “Bgt-net: Bidirectional gru transformer network for scene graph generation,” in IEEE CVPR, 2021, pp. 2150–2159.
  12. R. Li, S. Zhang, and X. He, “Sgtr: End-to-end scene graph generation with transformer,” in IEEE CVPR, 2022, pp. 19 486–19 496.
  13. Y. Cong, M. Y. Yang, and B. Rosenhahn, “Reltr: Relation transformer for scene graph generation,” IEEE TPAMI, 2023.
  14. S. Zhang, A. Hao, H. Qin et al., “Knowledge-inspired 3d scene graph prediction in point cloud,” NeurIPS, vol. 34, pp. 18 620–18 632, 2021.
  15. Z. Wang, B. Cheng, L. Zhao, D. Xu, Y. Tang, and L. Sheng, “Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud,” in IEEE CVPR, 2023, pp. 21 560–21 569.
  16. T. Czempiel, M. Paschali, M. Keicher, W. Simson, H. Feussner, S. T. Kim, and N. Navab, “Tecno: Surgical phase recognition with multi-stage temporal convolutional networks,” in MICCAI, 2020.
  17. E. Colleoni and D. Stoyanov, “Robotic instrument segmentation with image-to-image translation,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 935–942, 2021.
  18. Y. Jin, Y. Yu, C. Chen, Z. Zhao, P.-A. Heng, and D. Stoyanov, “Exploring intra-and inter-video relation for surgical semantic scene segmentation,” IEEE TMI, vol. 41, no. 11, pp. 2991–3002, 2022.
  19. H. Tu, C. Wang, and W. Zeng, “Voxelpose: Towards multi-camera 3d human pose estimation in wild environment,” in ECCV, 2020, pp. 197–212.
  20. Z. Liu, Z. Zhang, Y. Cao, H. Hu, and X. Tong, “Group-free 3d object detection via transformers,” in IEEE ICCV, 2021, pp. 2949–2958.
  21. Y. Lu, H. Rai, J. Chang, B. Knyazev, G. Yu, S. Shekhar, G. W. Taylor, and M. Volkovs, “Context-aware scene graph generation with seq2seq transformers,” in IEEE ICCV, 2021, pp. 15 931–15 941.
  22. H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua, “Visual translation embedding network for visual relation detection,” in IEEE CVPR, 2017, pp. 5532–5540.
  23. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE CVPR, 2015, pp. 3431–3440.
  24. H. Liu, N. Yan, M. Mortazavi, and B. Bhanu, “Fully convolutional scene graph generation,” in IEEE CVPR, 2021, pp. 11 546–11 556.
  25. E. Özsoy, F. Holm, T. Czempiel, N. Navab, and B. Busam, “Location-free scene graph generation,” arXiv preprint arXiv:2303.10944, 2023.
  26. C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” NeurIPS, vol. 30, 2017.
  27. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020, pp. 213–229.
  28. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE CVPR, 2016, pp. 770–778.
  29. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, vol. 30, 2017.
  30. S. Chen, T. Yu, and P. Li, “Mvt: Multi-view vision transformer for 3d object recognition,” in BMVC, 2021, p. 349.
  31. M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo, “Attentive relational networks for mapping images to scene graphs,” in IEEE CVPR, 2019, pp. 3957–3966.
  32. B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in IEEE CVPR, 2022, pp. 1290–1299.
  33. H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in IEEE CVPR, 2019, pp. 658–666.
  34. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in IEEE ICCV, 2017, pp. 2980–2988.
  35. R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones, “Character-level language modeling with deeper self-attention,” in AAAI, 2019, pp. 3159–3166.
  36. A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov et al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” IJCV, vol. 128, no. 7, pp. 1956–1981, 2020.
  37. C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T.-Y. Liu, “Do transformers really perform badly for graph representation?” NeurIPS, vol. 34, pp. 28 877–28 888, 2021.
  38. B. Kim, J. Lee, J. Kang, E.-S. Kim, and H. J. Kim, “Hotr: End-to-end human-object interaction detection with transformers,” in IEEE CVPR, 2021, pp. 74–83.
  39. Y. Liao, A. Zhang, M. Lu, Y. Wang, X. Li, and S. Liu, “Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection,” in IEEE CVPR, 2022, pp. 20 123–20 132.
  40. Y. Zhang, Y. Pan, T. Yao, R. Huang, T. Mei, and C.-W. Chen, “Exploring structure-aware transformer over interaction proposals for human-object interaction detection,” in IEEE CVPR, 2022, pp. 19 548–19 557.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube