Electronic Thesis and Dissertation Repository

Thesis Format

Monograph

Degree

Master of Engineering Science

Program

Electrical and Computer Engineering

Collaborative Specialization

Artificial Intelligence

Supervisor

Samarabandu, Jagath

Abstract

Corporate networks are constantly bombarded by malicious actors trying to gain access. The current state of the art in protecting networks is deep learning-based intrusion detection systems (IDS). However, for an IDS to be effective it needs to be trained on a good dataset. The best datasets for training an IDS are real data captured from large corporate networks. Unfortunately, companies cannot release their network data due to privacy concerns creating a lack of public cybersecurity data. In this thesis I take a novel approach to network dataset anonymization using character-level LSTM models to learn the characteristics of a dataset; then generate a new, anonymized, synthetic dataset, with similar characteristics to the original. This method shows excellent performance when tested for characteristic preservation and anonymization performance on three datasets. One that includes malicious and benign URLs, one with DNS packets, and one with malicious and benign TCP packets. Using this method I take the first step in solving the lack of publication of private network datasets.

Summary for Lay Audience

Corporate networks are constantly bombarded by hackers trying to gain access. The current state of the art in protecting networks is using artificial intelligence (AI) driven intrusion detection systems (IDS). However, for an IDS to be effective it needs to learn what a hacker's network activity looks like from a good dataset. The best datasets for training an IDS are real and from large corporate networks. Unfortunately, companies cannot release their network data due to privacy concerns creating a lack of public cybersecurity data. In this thesis I take a novel approach to network dataset anonymization using AI to learn the characteristics of a dataset; then generate a new, anonymized, synthetic dataset, with similar characteristics to the original. This method is tested for characteristic preservation and anonymization performance on three datasets. One that includes malicious and benign website addresses, one with DNS packets, and one with malicious and benign TCP packets. The results showed the AI was able to learn the structure and composition of these datasets and then generate its own synthetic anonymized version of these datasets. Using this AI-driven approach I take the first step in solving the lack of publicly available private network datasets for training IDSs.

Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License

Share

COinS