
Beyond Limits: Detecting Anomalies in Sparse, High-dimensional Data
Abstract
Anomaly detection is a critical aspect of data-driven decision-making, particularly in high-stakes areas such as fraud detection and identifying manufacturing defects. However, the proprietary nature and specialized use cases of such data often result in data that is both high-dimensional and has limited samples. These challenges arise because the data typically involves complex systems with numerous variables, and acquiring sufficient labeled examples is often cost-prohibitive or time-consuming. As a result, the data becomes sparse, and its high-dimensionality complicates the training of accurate models. This thesis addresses these issues by proposing a novel approach SparseDetect designed to detect anomalies in high dimensional and low sample situations. SparseDetect combines semi supervised anomaly detection algorithms with advanced statistical techniques, overcoming the limitations of traditional methods. By strategically selecting and grouping relevant data, the approach reduces the impact of high-dimensionality while ensuring robust model performance. The results demonstrate that SparseDetect achieves recall scores exceeding 92%, outperforming conventional anomaly detection methods, especially in scenarios with limited samples. This research offers valuable insights into anomaly detection for complex datasets, filling a critical gap in the literature and laying the foundation for future advancements in the field.