Thesis Format
Integrated Article
Degree
Master of Science
Program
Computer Science
Supervisor
Narayan, Apurva
Abstract
Anomaly detection is a critical aspect of data-driven decision-making, particularly in high-stakes areas such as fraud detection and identifying manufacturing defects. However, the proprietary nature and specialized use cases of such data often result in data that is both high-dimensional and has limited samples. These challenges arise because the data typically involves complex systems with numerous variables, and acquiring sufficient labeled examples is often cost-prohibitive or time-consuming. As a result, the data becomes sparse, and its high-dimensionality complicates the training of accurate models. This thesis addresses these issues by proposing a novel approach SparseDetect designed to detect anomalies in high dimensional and low sample situations. SparseDetect combines semi supervised anomaly detection algorithms with advanced statistical techniques, overcoming the limitations of traditional methods. By strategically selecting and grouping relevant data, the approach reduces the impact of high-dimensionality while ensuring robust model performance. The results demonstrate that SparseDetect achieves recall scores exceeding 92%, outperforming conventional anomaly detection methods, especially in scenarios with limited samples. This research offers valuable insights into anomaly detection for complex datasets, filling a critical gap in the literature and laying the foundation for future advancements in the field.
Summary for Lay Audience
Anomaly detection is essential for identifying hidden issues in critical areas like fraud prevention and manufacturing defect detection. For instance, consider a pacemaker an essential medical device. If a fault occurs during its manufacturing and goes undetected, it could have life-threatening consequences. Identifying such defects is necessary but also challenging because these devices operate on complex characteristics, making it difficult to pinpoint abnormalities within the data. Adding to the difficulty, this data is often high dimensional (involving numerous variables) and has limited samples due to the specialized and proprietary nature of the devices. This research addresses these challenges through SparseDetect, a novel method designed to detect anomalies in such intricate and data-scarce environments. By strategically grouping related features and focusing on the most relevant patterns, SparseDetect simplifies the complexity of high-dimensional data. It employs advanced techniques tailored to work even when anomalies are not pre-identified, ensuring a robust and adaptable approach. SparseDetect has achieved remarkable results, with a recall rate exceeding 92%, outperforming traditional anomaly detection methods. This approach bridges a critical gap in the field, providing a robust solution for identifying hidden issues in complex datasets. With results demonstrating significant improvements over conventional methods, SparseDetect offers industries a powerful tool to detect anomalies early. By offering an effective and scalable tool for anomaly detection, SparseDetect equips industries to address challenges early, safeguarding systems and ensuring reliability in high-stakes applications.
Recommended Citation
Shah, Ayush, "Beyond Limits: Detecting Anomalies in Sparse, High-dimensional Data" (2024). Electronic Thesis and Dissertation Repository. 10570.
https://ir.lib.uwo.ca/etd/10570
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License