3 research outputs found

    Time-series Generation by Contrastive Imitation

    Full text link
    Consider learning a generative model for time-series data. The sequential setting poses a unique challenge: Not only should the generator capture the conditional dynamics of (stepwise) transitions, but its open-loop rollouts should also preserve the joint distribution of (multi-step) trajectories. On one hand, autoregressive models trained by MLE allow learning and computing explicit transition distributions, but suffer from compounding error during rollouts. On the other hand, adversarial models based on GAN training alleviate such exposure bias, but transitions are implicit and hard to assess. In this work, we study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy, where the reinforcement signal is provided by a global (but stepwise-decomposable) energy model trained by contrastive estimation. At training, the two components are learned cooperatively, avoiding the instabilities typical of adversarial objectives. At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality. By expressly training a policy to imitate sequential behavior of time-series features in a dataset, this approach embodies "generation by imitation". Theoretically, we illustrate the correctness of this formulation and the consistency of the algorithm. Empirically, we evaluate its ability to generate predictively useful samples from real-world datasets, verifying that it performs at the standard of existing benchmarks

    Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions

    Full text link
    Limited data access is a longstanding barrier to data-driven research and development in the networked systems community. In this work, we explore if and how generative adversarial networks (GANs) can be used to incentivize data sharing by enabling a generic framework for sharing synthetic datasets with minimal expert knowledge. As a specific target, our focus in this paper is on time series datasets with metadata (e.g., packet loss rate measurements with corresponding ISPs). We identify key challenges of existing GAN approaches for such workloads with respect to fidelity (e.g., long-term dependencies, complex multidimensional relationships, mode collapse) and privacy (i.e., existing guarantees are poorly understood and can sacrifice fidelity). To improve fidelity, we design a custom workflow called DoppelGANger (DG) and demonstrate that across diverse real-world datasets (e.g., bandwidth measurements, cluster requests, web sessions) and use cases (e.g., structural characterization, predictive modeling, algorithm comparison), DG achieves up to 43% better fidelity than baseline models. Although we do not resolve the privacy problem in this work, we identify fundamental challenges with both classical notions of privacy and recent advances to improve the privacy properties of GANs, and suggest a potential roadmap for addressing these challenges. By shedding light on the promise and challenges, we hope our work can rekindle the conversation on workflows for data sharing.Comment: Published in IMC 2020. 20 pages, 26 figure

    Traffic microstructures and network anomaly detection

    Get PDF
    Much hope has been put in the modelling of network traffic with machine learning methods to detect previously unseen attacks. Many methods rely on features on a microscopic level such as packet sizes or interarrival times to identify reoccurring patterns and detect deviations from them. However, the success of these methods depends both on the quality of corresponding training and evaluation data as well as the understanding of the structures that methods learn. Currently, the academic community is lacking both, with widely used synthetic datasets facing serious problems and the disconnect between methods and data being named the "semantic gap". This thesis provides extensive examinations of the necessary requirements on traffic generation and microscopic traffic structures to enable the effective training and improvement of anomaly detection models. We first present and examine DetGen, a container-based traffic generation paradigm that enables precise control and ground truth information over factors that shape traffic microstructures. The goal of DetGen is to provide researchers with extensive ground truth information and enable the generation of customisable datasets that provide realistic structural diversity. DetGen was designed according to four specific traffic requirements that dataset generation needs to fulfil to enable machine-learning models to learn accurate and generalisable traffic representations. Current network intrusion datasets fail to meet these requirements, which we believe is one of the reasons for the lacking success of anomaly-based detection methods. We demonstrate the significance of these requirements experimentally by examining how model performance decreases when these requirements are not met. We then focus on the control and information over traffic microstructures that DetGen provides, and the corresponding benefits when examining and improving model failures for overall model development. We use three metrics to demonstrate that DetGen is able to provide more control and isolation over the generated traffic. The ground truth information DetGen provides enables us to probe two state-of-the-art traffic classifiers for failures on certain traffic structures, and the corresponding fixes in the model design almost halve the number of misclassifications . Drawing on these results, we propose CBAM, an anomaly detection model that detects network access attacks through deviations from reoccurring flow sequence patterns. CBAM is inspired by the design of self-supervised language models, and improves the AUC of current state-of-the-art by up to 140%. By understanding why several flow sequence structures present difficulties to our model, we make targeted design decisions that improve on these difficulties and ultimately boost the performance of our model. Lastly, we examine how the control and adversarial perturbation of traffic microstructures can be used by an attacker to evade detection. We show that in a stepping-stone attack, an attacker can evade every current detection model by mimicking the patterns observed in streaming services
    corecore