2 research outputs found

    A Plea for Utilising Synthetic Data when Performing Machine Learning Based Cyber-Security Experiments

    No full text

    Quantitative Analysis of Evaluation Criteria for Generative Models

    Get PDF
    Machine Learning (ML) is rapidly becoming integrated in critical aspects of cybersecurity today, particularly in the area of network intrusion/anomaly detection. However, ML techniques require large volumes of data to be effective. The available data is a critical aspect of the ML process for training, classification, and testing purposes. One solution to the problem is to generate synthetic data that is realistic. With the application of ML to this area, one promising application is the use of ML to perform the data generation. With the ability to generate synthetic data comes the need to evaluate the “realness” of the generated data. This research focuses specifically on the problem of evaluating the evaluation criteria. Quantitative analysis of evaluation criteria is important so that future research can have quantitative evidence for the evaluation criteria they utilize. The goal of this research is to provide a framework that can be used to inform and improve the process of generating synthetic semi-structured sequential data. A series of experiments evaluating a chosen set of metrics on discriminative ability and efficiency is performed. This research shows that the choice of feature space in which distances are calculated in is critical. The ability to discriminate between real and generated data hinges on the space that the distances are calculated in. Additionally, the choice of metric significantly affects the sample distance distributions in a suitable feature space. There are three main contributions from this work. First, this work provides the first known framework for evaluating metrics for semi-structured sequential synthetic data generation. Second, this work provides a “black box” evaluation framework which is generator agnostic. Third, this research provides the first known evaluation of metrics for semi-structured sequential data
    corecore