What Makes a Good Commit Message? (2202.02974v1)

Published 7 Feb 2022 in cs.SE and cs.HC

Abstract: A key issue in collaborative software development is communication among developers. One modality of communication is a commit message, in which developers describe the changes they make in a repository. As such, commit messages serve as an "audit trail" by which developers can understand how the source code of a project has changed-and why. Hence, the quality of commit messages affects the effectiveness of communication among developers. Commit messages are often of poor quality as developers lack time and motivation to craft a good message. Several automatic approaches have been proposed to generate commit messages. However, these are based on uncurated datasets including considerable proportions of poorly phrased commit messages. In this multi-method study, we first define what constitutes a "good" commit message, and then establish what proportion of commit messages lack information using a sample of almost 1,600 messages from five highly active open source projects. We find that an average of circa 44% of messages could be improved, suggesting the use of uncurated datasets may be a major threat when commit message generators are trained with such data. We also observe that prior work has not considered semantics of commit messages, and there is surprisingly little guidance available for writing good commit messages. To that end, we develop a taxonomy based on recurring patterns in commit messages' expressions. Finally, we investigate whether "good" commit messages can be automatically identified; such automation could prompt developers to write better commit messages.

Authors (5)

Yingchen Tian (1 paper)
Yuxia Zhang (12 papers)
Klaas-Jan Stol (8 papers)
Lin Jiang (24 papers)
Hui Liu (481 papers)

Citations (52)

View on Semantic Scholar

Summary

Understanding the Quality of Commit Messages in Software Development

The paper "What Makes a Good Commit Message?" provides an analytical examination of the quality of commit messages in collaborative software development, a crucial communication tool among developers. As indicated by the authors, the effectiveness of these messages in conveying the rationale and summary of code changes is paramount to maintain a coherent audit trail and to facilitate the software development process, especially in open-source projects.

Core Analysis and Findings

The researchers defined a "good" commit message as one that succinctly captures both what changes were made (denoted as "What") and why these changes were carried out (denoted as "Why"). A thorough analysis was conducted on a dataset of commit messages from five open-source software (OSS) projects on GitHub, which were filtered to remove those automatically generated by bots. The paper revealed a significant disparity in commit message quality, with an alarming average of 44% of messages needing improvement. The lack of sufficient "Why" information was noted as more prevalent than the absence of "What" information, suggesting a gap in articulating the rationale behind changes.

Taxonomy of Good Commit Messages

To gain insights into how developers effectively express the necessary information in commit messages, the authors crafted a taxonomy based on thematic analysis of well-written messages. For the "Why" component, five expression categories were identified: Describe Issue, Illustrate Requirement, Describe Objective, Imply Necessity, and Missing Why (where the reason is inferred automatically due to common sense). Analogously, the "What" component was characterized by four expression categories: Summarize Code Object Change, Describe Implementation Principle, Illustrate Function, and Missing What.

The authors also examined how these expression categories correlated with various maintenance activities, including corrective, adaptive, and perfective changes. They found distinct patterns in how the "Why" and "What" information was expressed across different types of maintenance tasks, which could serve as a guide for developers in crafting effective commit messages.

Automated Identification of Good Commit Messages

To address the challenge of identifying high-quality commit messages efficiently, the paper introduced classification models based on Bidirectional Long Short-Term Memory (Bi-LSTM) for automatic identification of well-written messages. These models achieved promising performance metrics, with an accuracy rate of 75.9% in detecting messages that effectively include both "Why" and "What" information. By employing such models, repositories can be curated more accurately, ensuring higher quality datasets for training automated commit message generators.

Implications for Practice and Research

The implications of this research are two-fold. For practitioners, the taxonomy and modeling insights can be directly applied to enhance the quality of commit messages, ensuring better communication within development teams and across the OSS community. These findings also highlight areas for future work, particularly in refining automated tools that assist developers in writing quality commit messages.

For researchers, these insights underscore the critical importance of curating benchmark datasets free of poor-quality messages for training purposes. The proposed models for automatic quality assessment offer a valuable tool for creating such datasets, providing a robust foundation for advancing automated commit message generation methods.

Conclusion

The paper makes a significant contribution to the understanding of commit message quality in software development. By systematically dissecting what constitutes a good commit message and proposing methodologies for their identification, this paper provides valuable guidance and tools for enhancing developer communication, which is vital for the successful evolution of software projects. Future research could extend these findings by exploring commit messages in other programming languages and development contexts to further refine and validate the models and taxonomies proposed.

PDF Markdown

Related Papers

YouTube

Show All Videos