Electronic Thesis and Dissertation Repository

Thesis Format

Integrated Article

Degree

Master of Science

Program

Computer Science

Supervisor

Kontogiannis, Kostas

Abstract

This thesis relates to the topic of software defect prediction within the broader area of continuous software engineering. The approach presented in this thesis is employing source code and process metrics obtained for each commit, and is examining as to whether specific patterns, as the system moves from one commit to another, can predict an impending bug inducing commit. The thesis utilizes the SonarQube Technical Debt open source data which provides source code metrics and process metrics for each commit in 22 medium to large scale open source Apache projects.

Central to this research is the novel utilization of commits to trace transitions to bug-inducing commits, facilitating the construction of a predictive model. In this approach, each commit is denoted by a vector of metrics values which have undergone pre-processing so can be efficiently used. Each such a vector defines the “state” of a commit. A significant portion of the methodology is devoted to meticulous data preparation and analysis, including the delineation of commit transitions, feature selection, and rigorous data cleansing. This rigorous process is aimed at enhancing the precision and accuracy of pattern recognition, particularly in identifying transitions leading to bug-inducing commits.

Through the integration of advanced methodologies encompassing correlation analysis, clustering techniques (including K-Means and Hierarchical clustering), and a suite of classification strategies such as KNN, Decision Trees, and innovative percentile-based classification, the study aims to identify emerging vector metrics state transition patterns which may be indicative of potential software bugs.

The results indicate that the proposed technique is promising on recognizing patterns indicative of potential impending bug inducing commits and sheds light on the practical implications of utilizing commit transitions in defect prediction strategies, offering insights into enhancing software development processes.

Summary for Lay Audience

This thesis focuses on predicting software defects as part of continuous software engineering. The study uses source code and process metrics from each commit to investigate whether specific patterns in the system's commit history can predict a bug-inducing commit. Data from 22 medium to large open source Apache projects is analyzed, using source code metrics and process metrics from SonarQube Technical Debt open source data.

The research uses commits to trace transitions to bug-inducing commits and builds a predictive model based on these transitions. Each commit is represented by a vector of metrics values that have been pre-processed for efficient use. This approach establishes the "state" of each commit, allowing for the identification of transitions leading to bug-inducing commits.

A key part of the methodology involves careful data preparation and analysis, including defining commit transitions, selecting features, and cleansing data. Advanced techniques such as correlation analysis, clustering (K-Means and Hierarchical), and classification strategies (KNN, Decision Trees, and percentile-based classification) help identify patterns in state transitions that may indicate potential software bugs.

The study's results suggest that the proposed approach is effective at recognizing patterns that point to potential impending bug-inducing commits. It highlights the practical benefits of using commit transitions for defect prediction, providing insights to enhance software development processes.

Share

COinS