Electronic Thesis and Dissertation Repository

Thesis Format



Master of Science


Computer Science


Kontogiannis, Kostas


Most software fault proneness prediction techniques utilize machine learning models which act as black boxes when performing predictions. Software developers cannot obtain any insights as to why such trained models reached their conclusions when applied to new data. This leads to a reduced confidence in accepting the prediction results while applying the model in complex systems. In this thesis, we propose two rule-based and programming language-agnostic fault proneness prediction techniques. The first technique utilizes fuzzy reasoning, while the second utilizes Markov Logic Networks. The rules operate on facts that are produced by harvesting and postprocessing raw data extracted from the GitHub records of the system that is being analyzed. Furthermore, files in each GitHub record are reconciled using bug resolution reports from corresponding Bugzilla repositories. The reconciliation process is used for tagging purposes so that the number of false positives in the raw data can be reduced. To better organize the extracted data, we group GitHub commits to form what we refer to as segments. Reasoning about fault proneness of a file is then considered at the level of a segment (i.e. whether the file will exhibit a failure in the time frame of the next segment – e.g. the 10 next commits).

In this thesis we have identified twenty generic rules, and we propose two processes to customize these rules for each system. The first process aims to select a subset of these twenty rules that perform the best (i.e. maximize prediction recall and prediction accuracy) for given a project. The selection is performed on a subset of the project’s historical data that serves as a training set and then these rules are applied to the rest of the system (current data). The second process aims to identify new rules by examining areas of opportunity to maximize prediction recall and accuracy. We have evaluated the proposed approach by applying six different strategies to answer four research questions related to which technique is best, whether there is a common set of rules that performs equally well in all projects, whether a rule set performing well in one project can be used in another, and whether customizing the rules is better for performance compared to generic ones.

We conclude the thesis by providing pointers for future research and how rule-based systems can be used in the field of Fault Prediction.

Summary for Lay Audience

In this thesis, we use two rule-based techniques to identify fault-prone files in large software systems. The vast majority of the literature in this field use Machine Learning for prediction. However, the drawback of Machine Learning approaches is that they do not provide explanations to the users on why and how a prediction result is reached. In the proposed approach, we allow for expert knowledge to be encoded in the form of If-Then rules. We utilize and compare two reasoning approaches, one based on Fuzzy Reasoning and other on Markov Logic Networks. We introduce twenty generic rules that can be optimized for each project in order to yield maximal recall. The results indicate that the rule based approaches provide comparable results with Machine Learning approaches with the added benefit of being able to provide explanations and a high degree of customizability.

Creative Commons License

Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Available for download on Wednesday, August 09, 2023