
Fuzzy and Probabilistic Rule-Based Approaches to Identify Fault Prone Files
Abstract
Most software fault proneness prediction techniques utilize machine learning models which act as black boxes when performing predictions. Software developers cannot obtain any insights as to why such trained models reached their conclusions when applied to new data. This leads to a reduced confidence in accepting the prediction results while applying the model in complex systems. In this thesis, we propose two rule-based and programming language-agnostic fault proneness prediction techniques. The first technique utilizes fuzzy reasoning, while the second utilizes Markov Logic Networks. The rules operate on facts that are produced by harvesting and postprocessing raw data extracted from the GitHub records of the system that is being analyzed. Furthermore, files in each GitHub record are reconciled using bug resolution reports from corresponding Bugzilla repositories. The reconciliation process is used for tagging purposes so that the number of false positives in the raw data can be reduced. To better organize the extracted data, we group GitHub commits to form what we refer to as segments. Reasoning about fault proneness of a file is then considered at the level of a segment (i.e. whether the file will exhibit a failure in the time frame of the next segment – e.g. the 10 next commits).
In this thesis we have identified twenty generic rules, and we propose two processes to customize these rules for each system. The first process aims to select a subset of these twenty rules that perform the best (i.e. maximize prediction recall and prediction accuracy) for given a project. The selection is performed on a subset of the project’s historical data that serves as a training set and then these rules are applied to the rest of the system (current data). The second process aims to identify new rules by examining areas of opportunity to maximize prediction recall and accuracy. We have evaluated the proposed approach by applying six different strategies to answer four research questions related to which technique is best, whether there is a common set of rules that performs equally well in all projects, whether a rule set performing well in one project can be used in another, and whether customizing the rules is better for performance compared to generic ones.
We conclude the thesis by providing pointers for future research and how rule-based systems can be used in the field of Fault Prediction.