Network Intrusion Detection Systems (NIDS) are crucial tools for protecting networked devices from cyberattacks. Recent development in the field of Artificial Intelligence (AI) has provided tremendous advantages in implementing NIDSs able to monitor network traffic and block cyberattacks in real-time. In the literature, it is widely recognized that the effective training of a NIDS requires a large quantity of labeled traffic, representative of attacks. Nonetheless, the availability of public and abundant datasets remains remarkably restricted due to the cost of gathering and labeling real traffic traces and privacy concerns for sharing them. To tackle these challenges, in this paper we present a generative AI model capable of synthesizing anonymized traffic traces from real ones, thus dealing with privacy, abundance, and representativeness. The proposal is based on a Conditional Variational Autoencoder (CVAE) and a preprocessing procedure specifically designed for the generation of new traffic traces. To validate our solution, we conduct an extensive empirical study leveraging three recent and publicly-available datasets, containing benign and malicious traffic. The validation is carried out from both the perspectives of classification performance of a robust NIDS and the quality of synthetic data, in comparison to the utilization of real data. We compare our CVAE with two state-of-the-art AI-based traffic data generators and prove that, trained with traces emitted by our generative model, a NIDS has a limited F1-score loss compared to training on real data; competing models instead struggle or fail to generate traces that are as effective for NIDS training and as statistically similar to the original. We make the synthetic datasets available in both PCAP and tabular formats, to facilitate the reproducibility of our findings and encourage further exploration in the field of generative AI for networking.
Synthetic and privacy-preserving traffic trace generation using generative AI models for training Network Intrusion Detection Systems / Aceto, G.; Giampaolo, F.; Guida, C.; Izzo, S.; Pescape, A.; Piccialli, F.; Prezioso, E.. - In: JOURNAL OF NETWORK AND COMPUTER APPLICATIONS. - ISSN 1084-8045. - 229:(2024). [10.1016/j.jnca.2024.103926]
Synthetic and privacy-preserving traffic trace generation using generative AI models for training Network Intrusion Detection Systems
Aceto G.;Giampaolo F.;Izzo S.;Piccialli F.;Prezioso E.
2024
Abstract
Network Intrusion Detection Systems (NIDS) are crucial tools for protecting networked devices from cyberattacks. Recent development in the field of Artificial Intelligence (AI) has provided tremendous advantages in implementing NIDSs able to monitor network traffic and block cyberattacks in real-time. In the literature, it is widely recognized that the effective training of a NIDS requires a large quantity of labeled traffic, representative of attacks. Nonetheless, the availability of public and abundant datasets remains remarkably restricted due to the cost of gathering and labeling real traffic traces and privacy concerns for sharing them. To tackle these challenges, in this paper we present a generative AI model capable of synthesizing anonymized traffic traces from real ones, thus dealing with privacy, abundance, and representativeness. The proposal is based on a Conditional Variational Autoencoder (CVAE) and a preprocessing procedure specifically designed for the generation of new traffic traces. To validate our solution, we conduct an extensive empirical study leveraging three recent and publicly-available datasets, containing benign and malicious traffic. The validation is carried out from both the perspectives of classification performance of a robust NIDS and the quality of synthetic data, in comparison to the utilization of real data. We compare our CVAE with two state-of-the-art AI-based traffic data generators and prove that, trained with traces emitted by our generative model, a NIDS has a limited F1-score loss compared to training on real data; competing models instead struggle or fail to generate traces that are as effective for NIDS training and as statistically similar to the original. We make the synthetic datasets available in both PCAP and tabular formats, to facilitate the reproducibility of our findings and encourage further exploration in the field of generative AI for networking.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.