Performance Evaluation of Lightweight Open-source Large Language Models in Pediatric Consultations: A Comparative Analysis (2407.15862v1)

Published 16 Jul 2024 in cs.LG, cs.AI, cs.CL, and cs.CY

Abstract: LLMs have demonstrated potential applications in medicine, yet data privacy and computational burden limit their deployment in healthcare institutions. Open-source and lightweight versions of LLMs emerge as potential solutions, but their performance, particularly in pediatric settings remains underexplored. In this cross-sectional study, 250 patient consultation questions were randomly selected from a public online medical forum, with 10 questions from each of 25 pediatric departments, spanning from December 1, 2022, to October 30, 2023. Two lightweight open-source LLMs, ChatGLM3-6B and Vicuna-7B, along with a larger-scale model, Vicuna-13B, and the widely-used proprietary ChatGPT-3.5, independently answered these questions in Chinese between November 1, 2023, and November 7, 2023. To assess reproducibility, each inquiry was replicated once. We found that ChatGLM3-6B demonstrated higher accuracy and completeness than Vicuna-13B and Vicuna-7B (P < .001), but all were outperformed by ChatGPT-3.5. ChatGPT-3.5 received the highest ratings in accuracy (65.2%) compared to ChatGLM3-6B (41.2%), Vicuna-13B (11.2%), and Vicuna-7B (4.4%). Similarly, in completeness, ChatGPT-3.5 led (78.4%), followed by ChatGLM3-6B (76.0%), Vicuna-13B (34.8%), and Vicuna-7B (22.0%) in highest ratings. ChatGLM3-6B matched ChatGPT-3.5 in readability, both outperforming Vicuna models (P < .001). In terms of empathy, ChatGPT-3.5 outperformed the lightweight LLMs (P < .001). In safety, all models performed comparably well (P > .05), with over 98.4% of responses being rated as safe. Repetition of inquiries confirmed these findings. In conclusion, Lightweight LLMs demonstrate promising application in pediatric healthcare. However, the observed gap between lightweight and large-scale proprietary LLMs underscores the need for continued development efforts.

Summary

The paper demonstrates that ChatGPT-3.5 achieves the highest accuracy (65.2% rated as good or very good), outperforming lightweight models like ChatGLM3-6B and the Vicuna variants.
The paper finds that ChatGLM3-6B matches ChatGPT-3.5 in readability and nearly attains its completeness levels, although other models lag behind.
The paper recommends further refinement through advanced training and cultural contextualization to enhance empathy and overall response quality in pediatric consultations.

Evaluation of Lightweight Open-source LLMs in Pediatric Consultations

The paper, titled "Performance Evaluation of Lightweight Open-source LLMs in Pediatric Consultations: A Comparative Analysis," offers a critical examination of lightweight LLMs in the domain of pediatric healthcare. The paper highlights the performance capabilities and limitations of these models in addressing pediatric consultation queries, thus offering valuable insights for the integration of LLMs into healthcare settings.

Study Design and Methods

To assess the performance of lightweight LLMs in pediatric consultations, the paper employed a cross-sectional design involving 250 consultation questions sourced from a public online medical forum, Haodf.com. These queries spanned 25 pediatric departments, capturing a broad spectrum of medical conditions. Four LLMs were selected for evaluation: ChatGLM3-6B, Vicuna-7B, Vicuna-13B, and the proprietary ChatGPT-3.5. Each model independently answered the questions in Chinese. Their performance was subsequently evaluated by three qualified pediatricians across five dimensions: accuracy, completeness, readability, empathy, and safety.

Findings

The paper revealed several critical findings regarding the comparative performance of the LLMs:

Accuracy: ChatGLM3-6B surpassed Vicuna-13B and Vicuna-7B (P < .001) but was outperformed by ChatGPT-3.5, which received the highest accuracy ratings at 65.2% “good” or “very good” evaluations.
Completeness: ChatGPT-3.5 led with 78.4% of responses rated as “complete” or “very complete,” while ChatGLM3-6B also performed well at 76.0%. Vicuna-13B and Vicuna-7B lagged significantly behind.
Readability: ChatGLM3-6B matched ChatGPT-3.5 in readability, outperforming the Vicuna models significantly.
Empathy: ChatGPT-3.5 exhibited superior empathy (P < .001), indicating a higher capacity for humanistic care in its responses.
Safety: All models demonstrated comparable safety, with over 98.4% of responses deemed safe.

These results were reproducible across repeated inquiries, affirming the robustness of the findings.

Implications

The paper underscores the potential of lightweight LLMs in pediatric healthcare environments, particularly when these models are tailored to specific linguistic contexts, as evidenced by ChatGLM3-6B's strong performance in Chinese-language medical consultations. Despite these promising results, the performance gap between lightweight models and the proprietary ChatGPT-3.5 suggests ongoing refinement is necessary. The authors advocate for further development to enhance the capabilities of lightweight LLMs, especially in terms of accuracy, completeness, and empathy.

Future Directions

The research indicates several avenues for future exploration and improvement:

Language and Cultural Contextualization: Tailoring LLMs to specific linguistic and cultural contexts can significantly enhance their performance, as seen with ChatGLM3-6B's success in the Chinese medical context.
Advanced Training Techniques: Employing techniques such as knowledge distillation, domain-specific pre-training, and continuous learning could improve model performance while maintaining computational efficiency.
Integration of Human Feedback: Continuous adaptation based on real-world interactions can refine the models' responses, ensuring greater relevance and accuracy in clinical settings.

Limitations

The paper's limitations include the sample's potential lack of representativeness across the global pediatric landscape and the exclusive focus on single-round structured dialogues rather than multi-round conversations typical of real-world clinical interactions. Additionally, direct comparisons with human pediatricians' performance were not undertaken, which limits insights into LLM efficacy in practical healthcare scenarios.

Conclusion

This research contributes a substantial evaluation of lightweight LLMs in pediatric consultations, highlighting both their promise and the areas requiring development. The findings advocate for ongoing refinement and adaptation of these models, emphasizing the necessity of context-specific training and enhanced capabilities. As LLM technology continues to evolve, its integration into pediatric healthcare presents an opportunity to address critical shortfalls in medical consultation accessibility and efficiency, particularly in resource-limited settings.

PDF Markdown

Related Papers

Tweets

https://twitter.com/cuiyingbeicheng/status/1816325799810904078