General surgery vision transformer: A video pre-trained foundation model for general surgery (2403.05949v3)
Abstract: The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction that can run in real-time for surgical applications, toward which we open-source the code and weights of GSViT; (iii) we also release code and weights for procedure-specific fine-tuned versions of GSViT across 10 procedures; (iv) we demonstrate the performance of GSViT on the Cholec80 phase annotation task, displaying improved performance over state-of-the-art single frame predictors.
- “Data-driven visual tracking in retinal microsurgery” In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2012: 15th International Conference, Nice, France, October 1-5, 2012, Proceedings, Part II 15, 2012, pp. 568–575 Springer
- Djork-Arné Clevert, Thomas Unterthiner and Sepp Hochreiter “Fast and accurate deep network learning by exponential linear units (elus)” In arXiv preprint arXiv:1511.07289, 2015
- Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E Hinton “Layer normalization” In arXiv preprint arXiv:1607.06450, 2016
- “Endonet: a deep architecture for recognition tasks on laparoscopic videos” In IEEE transactions on medical imaging 36.1 IEEE, 2016, pp. 86–97
- “CATARACTS: Challenge on automatic tool annotation for cataRACT surgery” In Medical image analysis 52 Elsevier, 2019, pp. 24–41
- “Efficientnet: Rethinking model scaling for convolutional neural networks” In International conference on machine learning, 2019, pp. 6105–6114 PMLR
- “2018 robotic scene segmentation challenge” In arXiv preprint arXiv:2001.11190, 2020
- “Tecno: Surgical phase recognition with multi-stage temporal convolutional networks” In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23, 2020, pp. 343–352 Springer
- “Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80” In arXiv preprint arXiv:2012.12453, 2020
- “Multi-task recurrent convolutional network with correlation loss for surgical video analysis” In Medical image analysis 59 Elsevier, 2020, pp. 101572
- “Glit: Neural architecture search for global and local image transformer” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12–21
- “Transunet: Transformers make strong encoders for medical image segmentation” In arXiv preprint arXiv:2102.04306, 2021
- “Robotic inguinal hernia repair: systematic review and meta-analysis” In ANZ Journal of Surgery 91.11 Wiley Online Library, 2021, pp. 2277–2287
- “Large language models are few-shot clinical information extractors” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 1998–2022
- “Simvp: Simpler yet better video prediction” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3170–3180
- “Maskvit: Masked visual pre-training for video prediction” In arXiv preprint arXiv:2206.11894, 2022
- Eun Jeong Jang, Kwan Woo Kim and Sung Hwa Kang “Early Experience of Pure Robotic Right Hepatectomy for Liver Donors in a Small-Volume Center” In JSLS: Journal of the Society of Laparoscopic & Robotic Surgeons 26.4 Society of Laparoscopic & Robotic Surgeons, 2022
- “Whether and When does Endoscopy Domain Pretraining Make Sense?” In arXiv preprint arXiv:2303.17636, 2023
- “A visual–language foundation model for pathology image analysis using medical twitter” In Nature medicine 29.9 Nature Publishing Group US New York, 2023, pp. 2307–2316
- “Llava-med: Training a large language-and-vision assistant for biomedicine in one day” In arXiv preprint arXiv:2306.00890, 2023
- “EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14420–14430
- “Lovit: Long video transformer for surgical phase recognition” In arXiv preprint arXiv:2305.08989, 2023
- “Can generalist foundation models outcompete special-purpose tuning? case study in medicine” In arXiv preprint arXiv:2311.16452, 2023
- “Towards expert-level medical question answering with large language models” In arXiv preprint arXiv:2305.09617, 2023
- “A foundation model for generalizable disease detection from retinal images” In Nature 622.7981 Nature Publishing Group UK London, 2023, pp. 156–163
- “Surgical tool classification and localization: results and methods from the MICCAI 2022 SurgToolLoc challenge” In arXiv preprint arXiv:2305.07152, 2023
- “Language models are susceptible to incorrect patient self-diagnosis in medical applications” In arXiv preprint arXiv:2309.09362, 2023
- “Segment anything in medical images” In Nature Communications 15.1 Nature Publishing Group UK London, 2024, pp. 654
- “ViT-AE++: improving vision transformer autoencoder for self-supervised medical image representations” In Medical Imaging with Deep Learning, 2024, pp. 666–679 PMLR
- “Addressing cognitive bias in medical language models” In arXiv preprint arXiv:2402.08113, 2024
- “General-purpose foundation models for increased autonomy in robot-assisted surgery” In arXiv preprint arXiv:2401.00678, 2024