General surgery vision transformer: A video pre-trained foundation model for general surgery (2403.05949v3)
Abstract: The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction that can run in real-time for surgical applications, toward which we open-source the code and weights of GSViT; (iii) we also release code and weights for procedure-specific fine-tuned versions of GSViT across 10 procedures; (iv) we demonstrate the performance of GSViT on the Cholec80 phase annotation task, displaying improved performance over state-of-the-art single frame predictors.
- “Data-driven visual tracking in retinal microsurgery” In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2012: 15th International Conference, Nice, France, October 1-5, 2012, Proceedings, Part II 15, 2012, pp. 568–575 Springer
- Djork-Arné Clevert, Thomas Unterthiner and Sepp Hochreiter “Fast and accurate deep network learning by exponential linear units (elus)” In arXiv preprint arXiv:1511.07289, 2015
- Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E Hinton “Layer normalization” In arXiv preprint arXiv:1607.06450, 2016
- “Endonet: a deep architecture for recognition tasks on laparoscopic videos” In IEEE transactions on medical imaging 36.1 IEEE, 2016, pp. 86–97
- “CATARACTS: Challenge on automatic tool annotation for cataRACT surgery” In Medical image analysis 52 Elsevier, 2019, pp. 24–41
- “Efficientnet: Rethinking model scaling for convolutional neural networks” In International conference on machine learning, 2019, pp. 6105–6114 PMLR
- “2018 robotic scene segmentation challenge” In arXiv preprint arXiv:2001.11190, 2020
- “Tecno: Surgical phase recognition with multi-stage temporal convolutional networks” In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23, 2020, pp. 343–352 Springer
- “Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80” In arXiv preprint arXiv:2012.12453, 2020
- “Multi-task recurrent convolutional network with correlation loss for surgical video analysis” In Medical image analysis 59 Elsevier, 2020, pp. 101572
- “Glit: Neural architecture search for global and local image transformer” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12–21
- “Transunet: Transformers make strong encoders for medical image segmentation” In arXiv preprint arXiv:2102.04306, 2021
- “Robotic inguinal hernia repair: systematic review and meta-analysis” In ANZ Journal of Surgery 91.11 Wiley Online Library, 2021, pp. 2277–2287
- “Large language models are few-shot clinical information extractors” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 1998–2022
- “Simvp: Simpler yet better video prediction” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3170–3180
- “Maskvit: Masked visual pre-training for video prediction” In arXiv preprint arXiv:2206.11894, 2022
- Eun Jeong Jang, Kwan Woo Kim and Sung Hwa Kang “Early Experience of Pure Robotic Right Hepatectomy for Liver Donors in a Small-Volume Center” In JSLS: Journal of the Society of Laparoscopic & Robotic Surgeons 26.4 Society of Laparoscopic & Robotic Surgeons, 2022
- “Whether and When does Endoscopy Domain Pretraining Make Sense?” In arXiv preprint arXiv:2303.17636, 2023
- “A visual–language foundation model for pathology image analysis using medical twitter” In Nature medicine 29.9 Nature Publishing Group US New York, 2023, pp. 2307–2316
- “Llava-med: Training a large language-and-vision assistant for biomedicine in one day” In arXiv preprint arXiv:2306.00890, 2023
- “EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14420–14430
- “Lovit: Long video transformer for surgical phase recognition” In arXiv preprint arXiv:2305.08989, 2023
- “Can generalist foundation models outcompete special-purpose tuning? case study in medicine” In arXiv preprint arXiv:2311.16452, 2023
- “Towards expert-level medical question answering with large language models” In arXiv preprint arXiv:2305.09617, 2023
- “A foundation model for generalizable disease detection from retinal images” In Nature 622.7981 Nature Publishing Group UK London, 2023, pp. 156–163
- “Surgical tool classification and localization: results and methods from the MICCAI 2022 SurgToolLoc challenge” In arXiv preprint arXiv:2305.07152, 2023
- “Language models are susceptible to incorrect patient self-diagnosis in medical applications” In arXiv preprint arXiv:2309.09362, 2023
- “Segment anything in medical images” In Nature Communications 15.1 Nature Publishing Group UK London, 2024, pp. 654
- “ViT-AE++: improving vision transformer autoencoder for self-supervised medical image representations” In Medical Imaging with Deep Learning, 2024, pp. 666–679 PMLR
- “Addressing cognitive bias in medical language models” In arXiv preprint arXiv:2402.08113, 2024
- “General-purpose foundation models for increased autonomy in robot-assisted surgery” In arXiv preprint arXiv:2401.00678, 2024
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.