Residual-based Language Models are Free Boosters for Biomedical Imaging (2403.17343v3)
Abstract: In this study, we uncover the unexpected efficacy of residual-based LLMs as part of encoders for biomedical imaging tasks, a domain traditionally devoid of language or textual data. The approach diverges from established methodologies by utilizing a frozen transformer block, extracted from pre-trained LLMs, as an innovative encoder layer for the direct processing of visual tokens. This strategy represents a significant departure from the standard multi-modal vision-language frameworks, which typically hinge on language-driven prompts and inputs. We found that these LLMs could boost performance across a spectrum of biomedical imaging applications, including both 2D and 3D visual classification tasks, serving as plug-and-play boosters. More interestingly, as a byproduct, we found that the proposed framework achieved superior performance, setting new state-of-the-art results on extensive, standardized datasets in MedMNIST-2D and 3D. Through this work, we aim to open new avenues for employing LLMs in biomedical imaging and enriching the understanding of their potential in this specialized domain.
- A dataset of microscopic peripheral blood cell images for development of automatic recognition systems. Data in brief, 30, 2020.
- Dataset of breast ultrasound images. Data in brief, 28:104863, 2020.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- A completed reference database of lung nodules on ct scans. Academic Radiology, 14(12):1455–1463, 2007.
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
- A survey of heterogeneous transfer learning, 2023.
- Making the most of text semantics to improve biomedical vision–language processing. In European conference on computer vision, pages 1–21. Springer, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Automated data curation for robust language model fine-tuning. arXiv preprint arXiv:2403.12776, 2024.
- When do you need chain-of-thought prompting for chatgpt? arXiv preprint arXiv:2304.03262, 2023a.
- Alleviating data imbalance issue with perturbed input during inference. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24, pages 407–417. Springer, 2021.
- Personalized fall risk assessment for long-term care services improvement. In 2017 Annual Reliability and Maintainability Symposium (RAMS), pages 1–7. IEEE, 2017a.
- Multi-state reliability demonstration tests. Quality Engineering, 29(3):431–445, 2017b.
- A data heterogeneity modeling and quantification approach for field pre-assessment of chloride-induced corrosion in aging infrastructures. Reliability Engineering & System Safety, 171:123–135, 2018.
- Claims data-driven modeling of hospital time-to-readmission risk with latent heterogeneity. Health care management science, 22:156–179, 2019.
- Optimal binomial reliability demonstration tests design under acceptance decision uncertainty. Quality Engineering, 32(3):492–508, 2020.
- Recontab: Regularized contrastive representation learning for tabular data. arXiv preprint arXiv:2310.18541, 2023b.
- Graph meets llm: A novel approach to collaborative filtering for robust conversational understanding. arXiv preprint arXiv:2305.14449, 2023c.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- Transmed: Transformers advance multi-modal medical image classification. Diagnostics, 11(8):1384, 2021.
- DC Dataset. The 2nd diabetic retinopathy–grading and image quality estimation challenge, 2020.
- Weakly and semi-supervised deep level set network for automated skin lesion segmentation. In Innovation in Medicine and Healthcare: Proceedings of 8th KES-InMed 2020, pages 145–155. Springer, 2020.
- Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Confidence trigger detection: an approach to build real-time tracking-by-detection system. arXiv preprint arXiv:1902.00615, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Utnet: a hybrid transformer architecture for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24, pages 61–71. Springer, 2021.
- From images to textual prompts: Zero-shot visual question answering with frozen large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10867–10877, 2023.
- Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- How many validation labels do you need? exploring the design space of label-efficient model ranking, 2024.
- Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3942–3951, 2021.
- Analyzing entropy features in time-series data for pattern recognition in neurological conditions. Artificial Intelligence in Medicine, page 102821, 2024.
- Machine learning and predictive analytics: Advancing disease prevention in healthcare. Journal of Contemporary Healthcare Analytics, 7(1):53–71, 2023.
- Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, pages 590–597, 2019.
- Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Grounding language models to images for multimodal inputs and outputs. 2023.
- Adaptive ensembles of fine-tuned transformers for llm-generated text detection. arXiv preprint arXiv:2403.13335, 2024.
- A-tip: attribute-aware text infilling via pre-trained language model. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5857–5869, 2022a.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022b.
- Towards fast adaptation of pretrained contrastive models for multi-channel video-language retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14846–14855, 2023.
- T3d: Towards 3d medical image understanding through vision-language pre-training. arXiv preprint arXiv:2312.01529, 2023a.
- Parameter-efficient transfer learning for medical visual question answering. IEEE Transactions on Emerging Topics in Computational Intelligence, 2023b.
- A chatgpt aided explainable framework for zero-shot medical image diagnosis. arXiv preprint arXiv:2307.01981, 2023c.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Annotated high-throughput microscopy image sets for validation. Nature methods, 9(7):637–637, 2012.
- Linearly mapping from image to text space. arXiv preprint arXiv:2209.15162, 2022.
- Frozen transformers in language models are effective visual encoder layers. arXiv preprint arXiv:2310.12973, 2023.
- Elastic net nonparallel hyperplane support vector machine and its geometrical rationality. IEEE Transactions on Neural Networks and Learning Systems, 33(12):7199–7209, 2021.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36, 2024.
- Combining incremental conductance and firefly algorithm for tracking the global mpp of pv arrays. Journal of Renewable and Sustainable Energy, 9(2), 2017.
- Going through the motions: AR/VR keylogging from user head motions. In 32nd USENIX Security Symposium (USENIX Security 23), pages 159–174, Anaheim, CA, 2023. USENIX Association.
- Table meets llm: Can large language models understand structured table data? a benchmark and empirical study. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 645–654, 2024.
- Optimizing crop management with reinforcement learning and imitation learning. arXiv preprint arXiv:2209.09991, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 5(1):1–9, 2018.
- Medical transformer: Gated axial-attention for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 36–46. Springer, 2021.
- Optimal test design for reliability demonstration under multi-stage acceptance uncertainties. Quality Engineering, 0(0):1–14, 2023a.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems, 36, 2024.
- Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017.
- Emp: emotion-guided multi-modal fusion and contrastive learning for personality traits recognition. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval, pages 243–252, 2023b.
- Balanced training for sparse gans. In Thirty-seventh Conference on Neural Information Processing Systems, 2023c.
- Unleashing the power of graph learning through llm-based autonomous agents. arXiv preprint arXiv:2309.04565, 2023.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Optimizing nitrogen management with deep reinforcement learning and crop simulations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1712–1720, 2022.
- Hallucination improves the performance of unsupervised visual representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16132–16143, 2023a.
- Genco: An auxiliary generator from contrastive learning for enhanced few-shot learning in remote sensing. arXiv preprint arXiv:2307.14612, 2023b.
- Extended agriculture-vision: An extension of a large aerial image dataset for agricultural pattern analysis. arXiv preprint arXiv:2303.02460, 2023c.
- Switchtab: Switched autoencoders are effective tabular learners. arXiv preprint arXiv:2401.02013, 2024.
- Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 10(1):41, 2023.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pages 558–567, 2021.
- Medical image classification using synergic deep learning. Medical Image Analysis, 54:10–19, 2019.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022a.
- Patch-level contrastive learning via positional query for visual pre-training. In International Conference on Machine Learning, pages 41990–41999. PMLR, 2023.
- Contrastive learning of medical visual representations from paired images and text. In Machine Learning for Healthcare Conference, pages 2–25. PMLR, 2022b.
- Visual in-context learning for large vision-language models. arXiv preprint arXiv:2402.11574, 2024.