Electronic Thesis and Dissertation Repository

Thesis Format

Monograph

Degree

Master of Science

Program

Computer Science

Supervisor

Madhavji, Nazim H.

Abstract

Context: With an increasing number of applications running on a microservices-based cloud system (such as AWS, GCP, IBM Cloud), it is challenging for the cloud providers to offer uninterrupted services with guaranteed Quality of Service (QoS) factors. Problem Statement: Existing monitoring frameworks often do not detect critical defects among a large volume of issues generated, thus affecting recovery response times and usage of maintenance human resource. Also, manually tracing the root causes of the issues requires a significant amount of time. Objective: The objective of this work is to: (i) detect performance anomalies, in real-time, through monitoring KPIs (Key Performance Indicators) using distributed tracing events, and (ii) identify their root causes. Proposed Solution: This thesis proposes an automated prediction-based anomaly detection and localization system, capable of detecting performance anomalies of a microservice using machine learning techniques, and determine their root-causes using a localization process. Novelty: The originality of this work lies in the detection process that uses a novel ensemble of a time-series forecasting model and three different unsupervised learning techniques that avoid defining static error thresholds to detect an anomaly and, instead follow a dynamic approach. Experimental Results: The proposed detection system was experimented using different variants of ensembles, evaluated on a real-world production dataset out of which two proposed ensembles outperformed the existing static rule-based approach with average F1-scores of 86% and 84%, average precision scores of 82% and 77% and average recall scores of 91% and 93% respectively across 6 experiments. The proposed detection ensembles were also evaluated on the Numenta Anomaly Benchmark (NAB) datasets and results show that the proposed method performs better than the Numenta’s standard HTM model score. Research Methodology: We adopted an agile methodology to conduct our research in an incremental and iterative fashion. Conclusion: The two proposed ensembles for anomaly detection perform better than the existing static rule-based approach.

Summary for Lay Audience

The stability of the cloud ecosystem is at stake with the continual growth and expansion of cloud adoption. For example, sluggish access to data, applications, and web pages frustrate the users and employees alike, and some performance problems can even cause application crashes and data loss. Existing monitoring frameworks often do not detect serious issues among a large volume of issues generated during the cloud system usage. This thus affects the recovery response times and effective use of maintenance human resource. We have developed an automated system to detect serious performance issues and their root causes that aimed to aid the maintenance team in fixing them.

The proposed detection system uses a novel combination of two algorithms to detect anomalies and it was evaluated on a real-world dataset from a production environment. The results show that two novel detection ensembles perform better than the existing static error thresholding approach. We also evaluated the proposed detection methods on the independent datasets with favourable outcomes.

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS