Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 34 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 80 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 461 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Pixel-Wise Recognition for Holistic Surgical Scene Understanding (2401.11174v3)

Published 20 Jan 2024 in cs.CV, cs.AI, and cs.LG

Abstract: This paper presents the Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity. Our approach encompasses long-term tasks, such as surgical phase and step recognition, and short-term tasks, including surgical instrument segmentation and atomic visual actions detection. To exploit our proposed benchmark, we introduce the Transformers for Actions, Phases, Steps, and Instrument Segmentation (TAPIS) model, a general architecture that combines a global video feature extractor with localized region proposals from an instrument segmentation model to tackle the multi-granularity of our benchmark. Through extensive experimentation in ours and alternative benchmarks, we demonstrate TAPIS's versatility and state-of-the-art performance across different tasks. This work represents a foundational step forward in Endoscopic Vision, offering a novel framework for future research towards holistic surgical scene understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (112)
  1. Youtube-8m: A large-scale video classification benchmark, in: arXiv:1609.08675. URL: https://arxiv.org/pdf/1609.08675v1.pdf.
  2. A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery. IEEE Transactions on Biomedical Engineering 64, 2025–2041.
  3. Cataracts: Challenge on automatic tool annotation for cataract surgery. Medical Image Analysis 52, 24–41. URL: https://www.sciencedirect.com/science/article/pii/S136184151830865X, doi:https://doi.org/10.1016/j.media.2018.11.008.
  4. 2018 robotic scene segmentation challenge. arXiv:2001.11190.
  5. 2017 robotic instrument segmentation challenge. arXiv:1902.06426.
  6. Gesture recognition in robotic surgery with multimodal attention. IEEE Transactions on Medical Imaging 41, 1677--1687. doi:10.1109/TMI.2022.3147640.
  7. Matis: Masked-attention transformers for surgical instrument segmentation, in: 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), pp. 1--5. doi:10.1109/ISBI53787.2023.10230819.
  8. From forks to forceps: A new framework for instance segmentation of surgical instruments, in: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), IEEE Computer Society, Los Alamitos, CA, USA. pp. 6180--6190. URL: https://doi.ieeecomputersociety.org/10.1109/WACV56688.2023.00613, doi:10.1109/WACV56688.2023.00613.
  9. Midwest and Its Children: The Psychological Ecology of an American Town. Row, Peterson. URL: https://books.google.com.co/books?id=rPQhAAAAMAAJ.
  10. The saras endoscopic surgeon action detection (esad) dataset: Challenges and methods. arXiv:2104.03178.
  11. Actions as space-time shapes, in: Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, IEEE. pp. 1395--1402.
  12. Comparative evaluation of instrument segmentation and tracking methods in minimally invasive surgery. arXiv:1805.02475.
  13. Activitynet: A large-scale video benchmark for human activity understanding, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961--970. doi:10.1109/CVPR.2015.7298698.
  14. End-to-end object detection with transformers, in: European conference on computer vision, Springer. pp. 213--229.
  15. Masked-attention mask transformer for universal image segmentation, in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1280--1289. doi:10.1109/CVPR52688.2022.00135.
  16. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems 34, 17864--17875.
  17. Is Robotic-Assisted surgery better? AMA J Ethics 25, E598--604.
  18. Opera: Attention-regularized transformers for surgical phase recognition, in: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (Eds.), Medical Image Computing and Computer Assisted Intervention -- MICCAI 2021, Springer International Publishing, Cham. pp. 604--614.
  19. Improving augmented reality through deep learning: Real-time instrument delineation in robotic renal surgery. European Urology 84, 86--91. URL: https://www.sciencedirect.com/science/article/pii/S0302283823026337, doi:https://doi.org/10.1016/j.eururo.2023.02.024.
  20. Deep learning in surgical workflow analysis: A review of phase and step recognition. IEEE Journal of Biomedical and Health Informatics 27, 5405--5417. doi:10.1109/JBHI.2023.3311628.
  21. BERT: Pre-training of deep bidirectional transformers for language understanding, in: Burstein, J., Doran, C., Solorio, T. (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp. 4171--4186. URL: https://aclanthology.org/N19-1423, doi:10.18653/v1/N19-1423.
  22. Learnable query initialization for surgical instrument instance segmentation, in: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda-Mahmood, T., Taylor, R. (Eds.), Medical Image Computing and Computer Assisted Intervention -- MICCAI 2023, Springer Nature Switzerland, Cham. pp. 728--738.
  23. Exploring segment-level semantics for online phase recognition from surgical videos. IEEE Transactions on Medical Imaging 41, 3309--3319. doi:10.1109/TMI.2022.3182995.
  24. An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations. URL: https://openreview.net/forum?id=YicbFdNTTy.
  25. Combined 2d and 3d tracking of surgical instruments for minimally invasive and robotic-assisted surgery. International Journal of Computer Assisted Radiology and Surgery 11, 1109--1119. URL: https://doi.org/10.1007/s11548-016-1393-4, doi:10.1007/s11548-016-1393-4.
  26. Articulated multi-instrument 2-d pose estimation using fully convolutional networks. IEEE Transactions on Medical Imaging 37, 1276--1287. doi:10.1109/TMI.2017.2787672.
  27. The pascal visual object classes challenge: A retrospective. International journal of computer vision 111, 98--136.
  28. Multiscale vision transformers, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824--6835.
  29. Automation and Autonomy in Robotic Surgery. Springer International Publishing, Cham. pp. 237--255. URL: https://doi.org/10.1007/978-3-030-53594-0_23, doi:10.1007/978-3-030-53594-0_23.
  30. The future of endoscopic navigation: A review of advanced endoscopic vision technology. IEEE Access 9, 41144--41167. doi:10.1109/ACCESS.2021.3065104.
  31. Trans-svnet: accurate phase recognition from surgical videos via hybrid embedding aggregation transformer, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 593--603.
  32. Jhu-isi gesture and skill assessment working set (jigsaws): A surgical activity dataset for human motion modeling. Modeling and Monitoring of Computer Assisted Interventions (M2CAI) – MICCAI Workshop .
  33. Toolnet: Holistically-nested real-time segmentation of robotic surgical tools, in: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5717--5722. doi:10.1109/IROS.2017.8206462.
  34. Robotic surgery: an evolution in practice. Journal of Surgical Protocols and Research Methodologies 2022, snac003. URL: https://doi.org/10.1093/jsprm/snac003, doi:10.1093/jsprm/snac003.
  35. Isinet: An instance-based approach for surgical instrument segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 595--605.
  36. A real-time spatiotemporal ai model analyzes skill in open surgical videos. arXiv:2112.07219.
  37. The “something something” video database for learning and evaluating visual common sense, in: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE Computer Society, Los Alamitos, CA, USA. pp. 5843--5851. URL: https://doi.ieeecomputersociety.org/10.1109/ICCV.2017.622, doi:10.1109/ICCV.2017.622.
  38. Cadis: Cataract dataset for surgical rgb-image segmentation. Medical Image Analysis 71, 102053. URL: https://www.sciencedirect.com/science/article/pii/S1361841521000992, doi:https://doi.org/10.1016/j.media.2021.102053.
  39. Ava: A video dataset of spatio-temporally localized atomic visual actions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047--6056.
  40. Multi-mode information fusion navigation system for robot-assisted vascular interventional surgery. BMC Surgery 23, 51. URL: https://doi.org/10.1186/s12893-023-01944-5, doi:10.1186/s12893-023-01944-5.
  41. Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, pp. 2961--2969.
  42. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770--778.
  43. Micro-surgical anastomose workflow recognition challenge report. Computer Methods and Programs in Biomedicine 212, 106452. URL: https://www.sciencedirect.com/science/article/pii/S0169260721005265, doi:https://doi.org/10.1016/j.cmpb.2021.106452.
  44. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155, 1--23.
  45. Intuitive Inc., 2014. da vinci si surgycal system user manual. https://fcc.report/FCC-ID/2AAZF-CHB01/2607924.pdf. [Accessed on 11-12-2023].
  46. Intuitive Inc., 2023. Da vinci x/xi system instrument and accessory catalog. https://www.intuitive.com/en-us/-/media/ISI/Intuitive/Pdf/xi-x-ina-catalog-no-pricing-us-1052082.pdf. [Accessed 11-12-2023].
  47. Learning and reasoning with the graph structure representation in robotic surgery, in: Martel, A.L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L. (Eds.), Medical Image Computing and Computer Assisted Intervention -- MICCAI 2020, Springer International Publishing, Cham. pp. 627--636.
  48. Towards understanding action recognition, in: Proceedings of the IEEE international conference on computer vision, pp. 3192--3199.
  49. Incorporating temporal prior from motion flow for instrument segmentation in minimally invasive surgery video, in: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.T., Khan, A. (Eds.), Medical Image Computing and Computer Assisted Intervention -- MICCAI 2019, Springer International Publishing, Cham. pp. 440--448.
  50. Toronto annotation suite. https://aidemos.cs.toronto.edu/toras.
  51. Large-scale video classification with convolutional neural networks, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725--1732.
  52. Context-aware augmented reality in laparoscopic surgery. Computerized Medical Imaging and Graphics 37, 174--182. URL: https://www.sciencedirect.com/science/article/pii/S0895611113000335, doi:https://doi.org/10.1016/j.compmedimag.2013.03.003. special Issue on Mixed Reality Guidance of Therapy - Towards Clinical Implementation.
  53. The kinetics human action video dataset. arXiv:1705.06950.
  54. Efficient visual event detection using volumetric features, in: Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, IEEE. pp. 166--173.
  55. Proposing novel methods for gynecologic surgical action recognition on laparoscopic videos. Multimedia Tools and Applications 79. doi:10.1007/s11042-020-09540-y.
  56. Comprehensive learning curve of robotic surgery: Discovery from a multicenter prospective trial of robotic gastrectomy. Annals of Surgery 273, 949–956. URL: http://dx.doi.org/10.1097/sla.0000000000003583, doi:10.1097/sla.0000000000003583.
  57. Artificial intelligence for context-aware surgical guidance in complex robot-assisted oncological procedures: An exploratory feasibility study. European Journal of Surgical Oncology , 106996URL: https://www.sciencedirect.com/science/article/pii/S0748798323006224, doi:https://doi.org/10.1016/j.ejso.2023.106996.
  58. Lapformer: surgical tool detection in laparoscopic surgical video using transformer architecture. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization 9, 302--307.
  59. Hmdb: a large video database for human motion recognition, in: 2011 International conference on computer vision, IEEE. pp. 2556--2563.
  60. Robotic surgery: A current perspective. Annals of Surgery 239. URL: https://journals.lww.com/annalsofsurgery/fulltext/2004/01000/robotic_surgery__a_current_perspective.3.aspx.
  61. The ava-kinetics localized human actions video dataset. arXiv:2005.00214.
  62. Mask dino: Towards a unified transformer-based framework for object detection and segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3041--3050.
  63. Microsoft coco: Common objects in context, in: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (Eds.), Computer Vision -- ECCV 2014, Springer International Publishing, Cham. pp. 740--755.
  64. Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012--10022.
  65. Video swin transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202--3211.
  66. SGDR: stochastic gradient descent with warm restarts, in: 5th International Conference on Learning Representations ICLR, OpenReview.net. URL: https://openreview.net/forum?id=Skq89Scxx.
  67. Decoupled weight decay regularization, in: International Conference on Learning Representations. URL: https://openreview.net/forum?id=Bkg6RiCqY7.
  68. Surgical data science – from concepts toward clinical translation. Medical Image Analysis 76, 102306. URL: https://www.sciencedirect.com/science/article/pii/S1361841521003510, doi:https://doi.org/10.1016/j.media.2021.102306.
  69. Surgical data science for next-generation interventions. Nature Biomedical Engineering 1, 691--696. URL: https://doi.org/10.1038/s41551-017-0132-7, doi:10.1038/s41551-017-0132-7.
  70. Heidelberg colorectal data set for surgical data science in the sensor operating room. Scientific Data 8, 1--11. URL: https://api.semanticscholar.org/CorpusID:218538016.
  71. Actions in context, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE. pp. 2929--2936.
  72. Computer vision in surgery: from potential to clinical value. npj Digital Medicine 5, 163. URL: https://doi.org/10.1038/s41746-022-00707-5, doi:10.1038/s41746-022-00707-5.
  73. Nephrec9. Zenodo doi:https://doi.org/10.5281/zenodo.1066831.
  74. Recognition of instrument-tissue interactions in endoscopic videos via action triplets, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 364--374.
  75. Data splits and metrics for method benchmarking on surgical action triplet datasets. arXiv:2204.05235.
  76. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Medical Image Analysis 78, 102433. URL: https://www.sciencedirect.com/science/article/pii/S1361841522000846, doi:https://doi.org/10.1016/j.media.2022.102433.
  77. Trecvid 2013 – an overview of the goals, tasks, data, evaluation mechanisms, and metrics .
  78. Statistical modeling and recognition of surgical workflow. Medical Image Analysis 16, 632--641. URL: https://www.sciencedirect.com/science/article/pii/S1361841510001131, doi:https://doi.org/10.1016/j.media.2010.10.001. computer Assisted Interventions.
  79. Campbell Walsh Wein Urology, E-Book. Elsevier, Philadelphia.
  80. A review of augmented reality in robotic-assisted surgery. IEEE Transactions on Medical Robotics and Bionics 2, 1--16. doi:10.1109/TMRB.2019.2957061.
  81. A review of haptic feedback in tele-operated robotic surgery. Journal of Medical Engineering & Technology 44, 247--254. URL: https://doi.org/10.1080/03091902.2020.1772391, doi:10.1080/03091902.2020.1772391, arXiv:https://doi.org/10.1080/03091902.2020.1772391. pMID: 32573288.
  82. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 1137--1149. doi:10.1109/TPAMI.2016.2577031.
  83. Action mach a spatio-temporal maximum average correlation height filter for action recognition, in: 2008 IEEE conference on computer vision and pattern recognition, IEEE. pp. 1--8.
  84. Comparative validation of multi-instance instrument segmentation in endoscopy: Results of the robust-mis 2019 challenge. Medical Image Analysis 70, 101920. URL: https://www.sciencedirect.com/science/article/pii/S136184152030284X, doi:https://doi.org/10.1016/j.media.2020.101920.
  85. Sensor substitution for video-based action recognition, in: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5230--5237. doi:10.1109/IROS.2016.7759769.
  86. Cataract-101: video dataset of 101 cataract surgeries, in: César, P., Zink, M., Murray, N. (Eds.), Proceedings of the 9th ACM Multimedia Systems Conference, MMSys 2018, Amsterdam, The Netherlands, June 12-15, 2018, ACM. pp. 421--425. URL: https://doi.org/10.1145/3204949.3208137, doi:10.1145/3204949.3208137.
  87. Recognizing human actions: a local svm approach, in: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., IEEE. pp. 32--36.
  88. Global-reasoned multi-task learning model for surgical scene understanding. IEEE Robotics and Automation Letters 7, 3858--3865.
  89. Automatic operating room surgical activity recognition for robot-assisted surgery, in: Martel, A.L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M.A., Zhou, S.K., Racoceanu, D., Joskowicz, L. (Eds.), Medical Image Computing and Computer Assisted Intervention -- MICCAI 2020, Springer International Publishing, Cham. pp. 385--395.
  90. Rendezvous in time: an attention-based temporal fusion approach for surgical triplet recognition. International Journal of Computer Assisted Radiology and Surgery , 1--7.
  91. Surgical action triplet detection by mixed supervised learning of instrument-tissue interactions, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 505--514.
  92. Automatic instrument segmentation in robot-assisted surgery using deep learning, in: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 624--628.
  93. Hollywood in homes: Crowdsourcing data collection for activity understanding, in: Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part I 14, Springer. pp. 510--526.
  94. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 .
  95. The tum lapchole dataset for the m2cai 2016 workflow challenge. arXiv:1610.09278.
  96. Comparison of 3d surgical tool segmentation procedures with robot kinematics prior, in: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4411--4418. doi:10.1109/IROS.2018.8594428.
  97. Real-time deep learning semantic segmentation during intra-operative surgery for 3d augmented reality assistance. International Journal of Computer Assisted Radiology and Surgery 16, 1435–1445. URL: http://dx.doi.org/10.1007/s11548-021-02432-y, doi:10.1007/s11548-021-02432-y.
  98. Label Studio: Data labeling software. URL: https://github.com/heartexlabs/label-studio. open source software available from https://github.com/heartexlabs/label-studio.
  99. Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging 36, 86--97.
  100. Towards holistic surgical scene understanding, in: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (Eds.), Medical Image Computing and Computer Assisted Intervention -- MICCAI 2022, Springer Nature Switzerland, Cham. pp. 442--452.
  101. Attention is all you need. Advances in neural information processing systems 30.
  102. Robotic gastrointestinal surgery: learning curve, educational programs and outcomes. Updates in Surgery 73, 799--814. URL: https://doi.org/10.1007/s13304-021-00973-0, doi:10.1007/s13304-021-00973-0.
  103. Comparative validation of machine learning algorithms for surgical workflow and skill analysis with the heichole benchmark. Medical Image Analysis 86, 102770. URL: https://www.sciencedirect.com/science/article/pii/S1361841523000312, doi:https://doi.org/10.1016/j.media.2023.102770.
  104. Neural rendering for stereo 3d reconstruction of deformable tissues in robotic surgery, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 431--441.
  105. Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision 126, 375--389.
  106. Discriminative subvolume search for efficient action detection, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE. pp. 2442--2449.
  107. DINO: DETR with improved denoising anchor boxes for end-to-end object detection, in: The Eleventh International Conference on Learning Representations. URL: https://openreview.net/forum?id=3mRwyG5one.
  108. Large-scale surgical workflow segmentation for laparoscopic sacrocolpopexy. International Journal of Computer Assisted Radiology and Surgery , 1--11.
  109. Hacs: Human action clips and segments dataset for recognition and temporal localization, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8668--8678.
  110. Trasetr: Track-to-segment transformer with contrastive query for instance-level instrument segmentation in robotic surgery, in: 2022 International Conference on Robotics and Automation (ICRA), pp. 11186--11193. doi:10.1109/ICRA46639.2022.9811873.
  111. Deformable {detr}: Deformable transformers for end-to-end object detection, in: International Conference on Learning Representations. URL: https://openreview.net/forum?id=gZ9hCDWe6ke.
  112. Deepphase: Surgical phase recognition in cataracts videos, in: Medical Image Computing and Computer Assisted Intervention – MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part IV, Springer-Verlag, Berlin, Heidelberg. p. 265–272. URL: https://doi.org/10.1007/978-3-030-00937-3_31, doi:10.1007/978-3-030-00937-3_31.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets