"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students (2402.01687v2)

Published 22 Jan 2024 in cs.CY, cs.HC, and cs.LG

Abstract: This study evaluates the effectiveness of various LLMs in performing tasks common among undergraduate computer science students. Although a number of research studies in the computing education community have explored the possibility of using LLMs for a variety of tasks, there is a lack of comprehensive research comparing different LLMs and evaluating which LLMs are most effective for different tasks. Our research systematically assesses some of the publicly available LLMs such as Google Bard, ChatGPT(3.5), GitHub Copilot Chat, and Microsoft Copilot across diverse tasks commonly encountered by undergraduate computer science students in India. These tasks include code explanation and documentation, solving class assignments, technical interview preparation, learning new concepts and frameworks, and email writing. Evaluation for these tasks was carried out by pre-final year and final year undergraduate computer science students and provides insights into the models' strengths and limitations. This study aims to guide students as well as instructors in selecting suitable LLMs for any specific task and offers valuable insights on how LLMs can be used constructively by students and instructors.

References (36)

Citations (1)

View on Semantic Scholar

Summary

The paper compared four major LLMs using quantitative (1-10 scale) and qualitative methods to assess their performance on CS tasks.
The findings highlight varied strengths: Microsoft Copilot excelled in code documentation, GitHub Copilot led in programming assignments, and ChatGPT performed best in email writing.
The study underscores selecting LLMs based on specific educational tasks, aiding students and educators in making informed choices.

Evaluating LLMs for Undergraduate Computer Science Tasks

Introduction

The utilization of LLMs in educational contexts, particularly in undergraduate computer science programs, has gained substantial attention. This paper ambitiously sets out to compare and evaluate the effectiveness of various publicly available LLMs—Google Bard, ChatGPT, GitHub Copilot Chat, and Microsoft Copilot—in facilitating tasks commonly performed by computer science students. These tasks span a range of activities, including code generation, project ideation, exam preparation, and email composition. Given the rapid expansion of LLMs and their application potential, this research offers valuable insights for students and educators in identifying the most suitable LLMs for specific educational tasks.

Methodology

The methodology employed in this paper involves a mixture of quantitative and qualitative analysis of four leading LLMs across universally encountered tasks among computer science students. These tasks were rigorously evaluated by both junior and senior computer science students, encompassing:

Code Explanation and Documentation
Class Assignments across Programming, Theoretical, and Humanities contexts
Technical Interview Preparation
Learning New Concepts and Frameworks
Writing Emails

The LLMs were assessed based on their ability to provide clear, accurate, and helpful responses across these tasks, with performance rated on a scale from 1 to 10.

Key Findings

The paper revealed that no single LLM outperforms others across all assessed tasks.

For code explanation and documentation, Microsoft Copilot excelled, indicating its robustness in dealing with a wide range of programming languages and presenting comprehensive code insights.
In class assignments, GitHub Copilot Chat led in programming assignments, leveraging its programming-centric design, whereas Microsoft Copilot was the frontrunner in both theoretical and humanities assignments, showcasing its versatility.
For technical interview preparation, both GitHub Copilot Chat and ChatGPT demonstrated high performance, suggesting that these models are particularly adept at solving algorithmic problems.
In aiding the learning of new concepts and frameworks, Google Bard emerged as the most effective, offering clear and insightful explanations that facilitate deeper understanding.
When it came to writing emails, ChatGPT was found to be superior, indicating its strength in generating contextually relevant and well-structured content.

Implications

This research underscores the diverse capabilities of current LLMs, suggesting that students and educators could benefit from choosing specific LLMs tailored to the needs of their tasks. It also highlights the importance of understanding the limitations and strengths of each LLM, advocating for a more informed selection process to optimize their utility in educational settings.

The findings further hint at the potential of LLMs to redefine the educational landscape, offering personalized assistance in learning new concepts, preparing for interviews, and handling assignments. However, the paper also cautions against over-reliance on these models, given their varying reliability across different tasks.

Future Directions

The rapidly evolving field of LLMs promises the introduction of more advanced models. Future work could extend this research to include upcoming LLMs, offering a dynamic and updated guide for their application in education. It also opens the floor for developing domain-specific LLMs, fine-tuned to meet the nuanced requirements of educational contexts, particularly in computer science education.

Conclusion

In conclusion, this paper presents a comprehensive evaluation of the performance of four major LLMs in tasks common to the undergraduate computer science curriculum. The varied performance across different tasks underscores the necessity of selecting LLMs based on the specific needs of the task at hand. As the development of LLMs continues to advance, this research provides a foundational understanding for leveraging their potential in educational settings, guiding both students and educators in their selection process.

PDF Markdown