3 research outputs found
Time-series Generation by Contrastive Imitation
Consider learning a generative model for time-series data. The sequential
setting poses a unique challenge: Not only should the generator capture the
conditional dynamics of (stepwise) transitions, but its open-loop rollouts
should also preserve the joint distribution of (multi-step) trajectories. On
one hand, autoregressive models trained by MLE allow learning and computing
explicit transition distributions, but suffer from compounding error during
rollouts. On the other hand, adversarial models based on GAN training alleviate
such exposure bias, but transitions are implicit and hard to assess. In this
work, we study a generative framework that seeks to combine the strengths of
both: Motivated by a moment-matching objective to mitigate compounding error,
we optimize a local (but forward-looking) transition policy, where the
reinforcement signal is provided by a global (but stepwise-decomposable) energy
model trained by contrastive estimation. At training, the two components are
learned cooperatively, avoiding the instabilities typical of adversarial
objectives. At inference, the learned policy serves as the generator for
iterative sampling, and the learned energy serves as a trajectory-level measure
for evaluating sample quality. By expressly training a policy to imitate
sequential behavior of time-series features in a dataset, this approach
embodies "generation by imitation". Theoretically, we illustrate the
correctness of this formulation and the consistency of the algorithm.
Empirically, we evaluate its ability to generate predictively useful samples
from real-world datasets, verifying that it performs at the standard of
existing benchmarks
Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
Limited data access is a longstanding barrier to data-driven research and
development in the networked systems community. In this work, we explore if and
how generative adversarial networks (GANs) can be used to incentivize data
sharing by enabling a generic framework for sharing synthetic datasets with
minimal expert knowledge. As a specific target, our focus in this paper is on
time series datasets with metadata (e.g., packet loss rate measurements with
corresponding ISPs). We identify key challenges of existing GAN approaches for
such workloads with respect to fidelity (e.g., long-term dependencies, complex
multidimensional relationships, mode collapse) and privacy (i.e., existing
guarantees are poorly understood and can sacrifice fidelity). To improve
fidelity, we design a custom workflow called DoppelGANger (DG) and demonstrate
that across diverse real-world datasets (e.g., bandwidth measurements, cluster
requests, web sessions) and use cases (e.g., structural characterization,
predictive modeling, algorithm comparison), DG achieves up to 43% better
fidelity than baseline models. Although we do not resolve the privacy problem
in this work, we identify fundamental challenges with both classical notions of
privacy and recent advances to improve the privacy properties of GANs, and
suggest a potential roadmap for addressing these challenges. By shedding light
on the promise and challenges, we hope our work can rekindle the conversation
on workflows for data sharing.Comment: Published in IMC 2020. 20 pages, 26 figure
Traffic microstructures and network anomaly detection
Much hope has been put in the modelling of network traffic with machine learning methods to detect previously unseen attacks. Many methods rely on features on a microscopic level such as packet sizes or interarrival times to identify reoccurring patterns and detect deviations from them. However, the success of these methods depends both on the quality of corresponding training and evaluation data as well as the understanding of the structures that methods learn. Currently, the academic community is lacking both, with widely used synthetic datasets facing serious problems and the disconnect between methods and data being named the "semantic gap".
This thesis provides extensive examinations of the necessary requirements on traffic generation and microscopic traffic structures to enable the effective training and improvement of anomaly detection models. We first present and examine DetGen, a container-based traffic generation paradigm that enables precise control and ground truth information over factors that shape traffic microstructures. The goal of DetGen is to provide researchers with extensive ground truth information and enable the generation of customisable datasets that provide realistic structural diversity.
DetGen was designed according to four specific traffic requirements that dataset generation needs to fulfil to enable machine-learning models to learn accurate and generalisable traffic representations. Current network intrusion datasets fail to meet these requirements, which we believe is one of the reasons for the lacking success of anomaly-based detection methods. We demonstrate the significance of these requirements experimentally by examining how model performance decreases when these requirements are not met.
We then focus on the control and information over traffic microstructures that DetGen provides, and the corresponding benefits when examining and improving model failures for overall model development. We use three metrics to demonstrate that DetGen is able to provide more control and isolation over the generated traffic. The ground truth information DetGen provides enables us to probe two state-of-the-art traffic classifiers for failures on certain traffic structures, and the corresponding fixes in the model design almost halve the number of misclassifications .
Drawing on these results, we propose CBAM, an anomaly detection model that detects network access attacks through deviations from reoccurring flow sequence patterns. CBAM is inspired by the design of self-supervised language models, and improves the AUC of current state-of-the-art by up to 140%. By understanding why several flow sequence structures present difficulties to our model, we make targeted design decisions that improve on these difficulties and ultimately boost the performance of our model.
Lastly, we examine how the control and adversarial perturbation of traffic microstructures can be used by an attacker to evade detection. We show that in a stepping-stone attack, an attacker can evade every current detection model by mimicking the patterns observed in streaming services