Electronic Thesis and Dissertation Repository

Anonymization & Generation of Network Packet Datasets Using Deep learning

Spencer K. Vecile, The University of Western Ontario

Abstract

Corporate networks are constantly bombarded by malicious actors trying to gain access. The current state of the art in protecting networks is deep learning-based intrusion detection systems (IDS). However, for an IDS to be effective it needs to be trained on a good dataset. The best datasets for training an IDS are real data captured from large corporate networks. Unfortunately, companies cannot release their network data due to privacy concerns creating a lack of public cybersecurity data. In this thesis I take a novel approach to network dataset anonymization using character-level LSTM models to learn the characteristics of a dataset; then generate a new, anonymized, synthetic dataset, with similar characteristics to the original. This method shows excellent performance when tested for characteristic preservation and anonymization performance on three datasets. One that includes malicious and benign URLs, one with DNS packets, and one with malicious and benign TCP packets. Using this method I take the first step in solving the lack of publication of private network datasets.