Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

General surgery vision transformer: A video pre-trained foundation model for general surgery (2403.05949v3)

Published 9 Mar 2024 in cs.CV, cs.LG, and q-bio.TO

Abstract: The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction that can run in real-time for surgical applications, toward which we open-source the code and weights of GSViT; (iii) we also release code and weights for procedure-specific fine-tuned versions of GSViT across 10 procedures; (iv) we demonstrate the performance of GSViT on the Cholec80 phase annotation task, displaying improved performance over state-of-the-art single frame predictors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. “Data-driven visual tracking in retinal microsurgery” In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2012: 15th International Conference, Nice, France, October 1-5, 2012, Proceedings, Part II 15, 2012, pp. 568–575 Springer
  2. Djork-Arné Clevert, Thomas Unterthiner and Sepp Hochreiter “Fast and accurate deep network learning by exponential linear units (elus)” In arXiv preprint arXiv:1511.07289, 2015
  3. Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E Hinton “Layer normalization” In arXiv preprint arXiv:1607.06450, 2016
  4. “Endonet: a deep architecture for recognition tasks on laparoscopic videos” In IEEE transactions on medical imaging 36.1 IEEE, 2016, pp. 86–97
  5. “CATARACTS: Challenge on automatic tool annotation for cataRACT surgery” In Medical image analysis 52 Elsevier, 2019, pp. 24–41
  6. “Efficientnet: Rethinking model scaling for convolutional neural networks” In International conference on machine learning, 2019, pp. 6105–6114 PMLR
  7. “2018 robotic scene segmentation challenge” In arXiv preprint arXiv:2001.11190, 2020
  8. “Tecno: Surgical phase recognition with multi-stage temporal convolutional networks” In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23, 2020, pp. 343–352 Springer
  9. “Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80” In arXiv preprint arXiv:2012.12453, 2020
  10. “Multi-task recurrent convolutional network with correlation loss for surgical video analysis” In Medical image analysis 59 Elsevier, 2020, pp. 101572
  11. “Glit: Neural architecture search for global and local image transformer” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12–21
  12. “Transunet: Transformers make strong encoders for medical image segmentation” In arXiv preprint arXiv:2102.04306, 2021
  13. “Robotic inguinal hernia repair: systematic review and meta-analysis” In ANZ Journal of Surgery 91.11 Wiley Online Library, 2021, pp. 2277–2287
  14. “Large language models are few-shot clinical information extractors” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 1998–2022
  15. “Simvp: Simpler yet better video prediction” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3170–3180
  16. “Maskvit: Masked visual pre-training for video prediction” In arXiv preprint arXiv:2206.11894, 2022
  17. Eun Jeong Jang, Kwan Woo Kim and Sung Hwa Kang “Early Experience of Pure Robotic Right Hepatectomy for Liver Donors in a Small-Volume Center” In JSLS: Journal of the Society of Laparoscopic & Robotic Surgeons 26.4 Society of Laparoscopic & Robotic Surgeons, 2022
  18. “Whether and When does Endoscopy Domain Pretraining Make Sense?” In arXiv preprint arXiv:2303.17636, 2023
  19. “A visual–language foundation model for pathology image analysis using medical twitter” In Nature medicine 29.9 Nature Publishing Group US New York, 2023, pp. 2307–2316
  20. “Llava-med: Training a large language-and-vision assistant for biomedicine in one day” In arXiv preprint arXiv:2306.00890, 2023
  21. “EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14420–14430
  22. “Lovit: Long video transformer for surgical phase recognition” In arXiv preprint arXiv:2305.08989, 2023
  23. “Can generalist foundation models outcompete special-purpose tuning? case study in medicine” In arXiv preprint arXiv:2311.16452, 2023
  24. “Towards expert-level medical question answering with large language models” In arXiv preprint arXiv:2305.09617, 2023
  25. “A foundation model for generalizable disease detection from retinal images” In Nature 622.7981 Nature Publishing Group UK London, 2023, pp. 156–163
  26. “Surgical tool classification and localization: results and methods from the MICCAI 2022 SurgToolLoc challenge” In arXiv preprint arXiv:2305.07152, 2023
  27. “Language models are susceptible to incorrect patient self-diagnosis in medical applications” In arXiv preprint arXiv:2309.09362, 2023
  28. “Segment anything in medical images” In Nature Communications 15.1 Nature Publishing Group UK London, 2024, pp. 654
  29. “ViT-AE++: improving vision transformer autoencoder for self-supervised medical image representations” In Medical Imaging with Deep Learning, 2024, pp. 666–679 PMLR
  30. “Addressing cognitive bias in medical language models” In arXiv preprint arXiv:2402.08113, 2024
  31. “General-purpose foundation models for increased autonomy in robot-assisted surgery” In arXiv preprint arXiv:2401.00678, 2024
Citations (4)

Summary

We haven't generated a summary for this paper yet.