Improving the Generation of Labeled Network Traffic Datasets Through Machine Learning Techniques

Catania, Carlos; Guerra, Jorge

Improving the Generation of Labeled Network Traffic Datasets Through Machine Learning Techniques

Authors: Carlos Catania
Jorge Guerra
Publication date: 1 October 2017
Publisher

Abstract

The problem of detecting malicious behavior in network traffic has become an extremely difficult challenge for the security community. Consequently, several intelligence-based tools have been proposed to generate models capable of understanding the information traveling through the network and to help in the identification of suspicious connections as soon as possible. However, the lack of high-quality datasets has been one of the main obstacles in the developing of reliable intelligence-based tools. A well-labeled dataset is fundamental not only for the process of automatically learning models but also for testing its performance. Recently, RiskID emerged with the goal of providing to the network security community a collaborative tool for helping the labeling process. Through the use of visual and statistical techniques, RiskID facilitates to the user the generation of labeled datasets from real connections. In this article, we present a machine learning extension for RiskID, to help the user in the malware identification process. A preliminary study shows that as the size of labeled data increases, the use of machine learning models can be a valuable tool during the labeling process of future traffic connections.VI Workshop de Seguridad Informática (WSI).Red de Universidades con Carreras en Informática (RedUNCI