Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation (2404.00998v1)

Published 1 Apr 2024 in cs.CL and cs.AI

Abstract: Evaluating generated radiology reports is crucial for the development of radiology AI, but existing metrics fail to reflect the task's clinical requirements. This study proposes a novel evaluation framework using LLMs to compare radiology reports for assessment. We compare the performance of various LLMs and demonstrate that, when using GPT-4, our proposed metric achieves evaluation consistency close to that of radiologists. Furthermore, to reduce costs and improve accessibility, making this method practical, we construct a dataset using LLM evaluation results and perform knowledge distillation to train a smaller model. The distilled model achieves evaluation capabilities comparable to GPT-4. Our framework and distilled model offer an accessible and efficient evaluation method for radiology report generation, facilitating the development of more clinically relevant models. The model will be further open-sourced and accessible.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Radpeer peer review: relevance, use, concerns, challenges, and direction forward. Journal of the American College of Radiology, 11(9):899–904, 2014.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. From sparse to dense: Gpt-4 summarization with chain of density prompting. arXiv preprint arXiv:2309.04269, 2023.
  4. Artificial intelligence solutions for analysis of x-ray images. Canadian Association of Radiologists Journal, 72(1):60–72, 2021. PMID: 32757950.
  5. R. Alvarado. Should we replace radiologists with deep learning? pigeons, error and trust in medical ai. Bioethics, 36(2):121–133, 2022.
  6. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  7. S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  8. Radiology-aware model-based evaluation metric for report generation. arXiv preprint arXiv:2311.16764, 2023.
  9. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023.
  10. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  11. Openmedlm: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. In AAAI 2024 Spring Symposium on Clinical Foundation Models, 2024.
  12. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023.
  13. A systematic review of the diagnostic accuracy of artificial intelligence-based computer programs to analyze chest x-rays for pulmonary tuberculosis. PloS one, 14(9):e0221339, 2019.
  14. Large language model meets graph neural network in knowledge distillation. arXiv preprint arXiv:2402.05894, 2024.
  15. Maira-1: A specialised large multimodal model for radiology report generation. arXiv preprint arXiv:2311.13668, 2023.
  16. Radgraph: Extracting clinical entities and relations from radiology reports. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  17. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  18. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  19. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019.
  20. Radgraph2: Modeling disease progression in radiology reports via hierarchical information extraction. In Machine Learning for Healthcare Conference, pages 381–402. PMLR, 2023.
  21. C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  22. G-eval: Nlg evaluation using gpt-4 with better human alignment. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  24. Combining automatic labelers and expert annotations for accurate radiology report labeling using bert. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1500–1519, 2020.
  25. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  26. P. Törnberg. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588, 2023.
  27. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  28. Chatcad: Interactive computer-aided diagnosis on medical image using large language models. arXiv preprint arXiv:2302.07257, 2023.
  29. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  30. Evaluating progress in automatic chest x-ray radiology report generation. Patterns, 4(9), 2023.
  31. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  32. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  33. Universalner: Targeted distillation from large language models for open named entity recognition. arXiv preprint arXiv:2308.03279, 2023.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube