16 research outputs found

    Outlier Identification in Spatio-Temporal Processes

    Full text link
    This dissertation answers some of the statistical challenges arising in spatio-temporal data from Internet traffic, electricity grids and climate models. It begins with methodological contributions to the problem of anomaly detection in communication networks. Using electricity consumption patterns for University of Michigan campus, the well known spatial prediction method kriging has been adapted for identification of false data injections into the system. Events like Distributed Denial of Service (DDoS), Botnet/Malware attacks, Port Scanning etc. call for methods which can identify unusual activity in Internet traffic patterns. Storing information on the entire network though feasible cannot be done at the time scale at which data arrives. In this work, hashing techniques which can produce summary statistics for the network have been used. The hashed data so obtained indeed preserves the heavy tailed nature of traffic payloads, thereby providing a platform for the application of extreme value theory (EVT) to identify heavy hitters in volumetric attacks. These methods based on EVT require the estimation of the tail index of a heavy tailed distribution. The traditional estimators (Hill et al. (1975)) for the tail index tend to be biased in the presence of outliers. To circumvent this issue, a trimmed version of the classic Hill estimator has been proposed and studied from a theoretical perspective. For the Pareto domain of attraction, the optimality and asymptotic normality of the estimator has been established. Additionally, a data driven strategy to detect the number of extreme outliers in heavy tailed data has also been presented. The dissertation concludes with the statistical formulation of m-year return levels of extreme climatic events (heat/cold waves). The Generalized Pareto distribution (GPD) serves as good fit for modeling peaks over threshold of a distribution. Allowing the parameters of the GPD to vary as a function of covariates such as time of the year, El-Nino and location in the US, extremes of the areal impact of heat waves have been well modeled and inferred.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145789/1/shrijita_1.pd

    Machine learning and privacy preserving algorithms for spatial and temporal sensing

    Get PDF
    Sensing physical and social environments are ubiquitous in modern mobile phones, IoT devices, and infrastructure-based settings. Information engraved in such data, especially the time and location attributes have unprecedented potential to characterize individual and crowd behaviour, natural and technological processes. However, it is challenging to extract abstract knowledge from the data due to its massive size, sequential structure, asynchronous operation, noisy characteristics, privacy concerns, and real time analysis requirements. Therefore, the primary goal of this thesis is to propose theoretically grounded and practically useful algorithms to learn from location and time stamps in sensor data. The proposed methods are inspired by tools from geometry, topology, and statistics. They leverage structures in the temporal and spatial data by probabilistically modeling noise, exploring topological structures embedded, and utilizing statistical structure to protect personal information and simultaneously learn aggregate information. Proposed algorithms are geared towards streaming and distributed operation for efficiency. The usefulness of the methods is argued using mathematical analysis and empirical experiments on real and artificial datasets

    A framework for the dynamic management of Peer-to-Peer overlays

    Get PDF
    Peer-to-Peer (P2P) applications have been associated with inefficient operation, interference with other network services and large operational costs for network providers. This thesis presents a framework which can help ISPs address these issues by means of intelligent management of peer behaviour. The proposed approach involves limited control of P2P overlays without interfering with the fundamental characteristics of peer autonomy and decentralised operation. At the core of the management framework lays the Active Virtual Peer (AVP). Essentially intelligent peers operated by the network providers, the AVPs interact with the overlay from within, minimising redundant or inefficient traffic, enhancing overlay stability and facilitating the efficient and balanced use of available peer and network resources. They offer an “insider‟s” view of the overlay and permit the management of P2P functions in a compatible and non-intrusive manner. AVPs can support multiple P2P protocols and coordinate to perform functions collectively. To account for the multi-faceted nature of P2P applications and allow the incorporation of modern techniques and protocols as they appear, the framework is based on a modular architecture. Core modules for overlay control and transit traffic minimisation are presented. Towards the latter, a number of suitable P2P content caching strategies are proposed. Using a purpose-built P2P network simulator and small-scale experiments, it is demonstrated that the introduction of AVPs inside the network can significantly reduce inter-AS traffic, minimise costly multi-hop flows, increase overlay stability and load-balancing and offer improved peer transfer performance

    Building and evaluating privacy-preserving data processing systems

    Get PDF
    Large-scale data processing prompts a number of important challenges, including guaranteeing that collected or published data is not misused, preventing disclosure of sensitive information, and deploying privacy protection frameworks that support usable and scalable services. In this dissertation, we study and build systems geared for privacy-friendly data processing, enabling computational scenarios and applications where potentially sensitive data can be used to extract useful knowledge, and which would otherwise be impossible without such strong privacy guarantees. For instance, we show how to privately and efficiently aggregate data from many sources and large streams, and how to use the aggregates to extract useful statistics and train simple machine learning models. We also present a novel technique for privately releasing generative machine learning models and entire high-dimensional datasets produced by these models. Finally, we demonstrate that the data used by participants in training generative and collaborative learning models may be vulnerable to inference attacks and discuss possible mitigation strategies

    Sixth Biennial Report : August 2001 - May 2003

    No full text

    Network-Wide Monitoring And Debugging

    Get PDF
    Modern networks can encompass over 100,000 servers. Managing such an extensive network with a diverse set of network policies has become more complicated with the introduction of programmable hardwares and distributed network functions. Furthermore, service level agreements (SLAs) require operators to maintain high performance and availability with low latencies. Therefore, it is crucial for operators to resolve any issues in networks quickly. The problems can occur at any layer of stack: network (load imbalance), data-plane (incorrect packet processing), control-plane (bugs in configuration) and the coordination among them. Unfortunately, existing debugging tools are not sufficient to monitor, analyze, or debug modern networks; either they lack visibility in the network, require manual analysis, or cannot check for some properties. These limitations arise from the outdated view of the networks, i.e., that we can look at a single component in isolation. In this thesis, we describe a new approach that looks at measuring, understanding, and debugging the network across devices and time. We also target modern stateful packet processing devices: programmable data-planes and distributed network functions as these becoming increasingly common part of the network. Our key insight is to leverage both in-network packet processing (to collect precise measurements) and out-of-network processing (to coordinate measurements and scale analytics). The resulting systems we design based on this approach can support testing and monitoring at the data center scale, and can handle stateful data in the network. We automate the collection and analysis of measurement data to save operator time and take a step towards self driving networks

    LIPIcs, Volume 251, ITCS 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 251, ITCS 2023, Complete Volum

    Online learning on the programmable dataplane

    Get PDF
    This thesis makes the case for managing computer networks with datadriven methods automated statistical inference and control based on measurement data and runtime observations—and argues for their tight integration with programmable dataplane hardware to make management decisions faster and from more precise data. Optimisation, defence, and measurement of networked infrastructure are each challenging tasks in their own right, which are currently dominated by the use of hand-crafted heuristic methods. These become harder to reason about and deploy as networks scale in rates and number of forwarding elements, but their design requires expert knowledge and care around unexpected protocol interactions. This makes tailored, per-deployment or -workload solutions infeasible to develop. Recent advances in machine learning offer capable function approximation and closed-loop control which suit many of these tasks. New, programmable dataplane hardware enables more agility in the network— runtime reprogrammability, precise traffic measurement, and low latency on-path processing. The synthesis of these two developments allows complex decisions to be made on previously unusable state, and made quicker by offloading inference to the network. To justify this argument, I advance the state of the art in data-driven defence of networks, novel dataplane-friendly online reinforcement learning algorithms, and in-network data reduction to allow classification of switchscale data. Each requires co-design aware of the network, and of the failure modes of systems and carried traffic. To make online learning possible in the dataplane, I use fixed-point arithmetic and modify classical (non-neural) approaches to take advantage of the SmartNIC compute model and make use of rich device local state. I show that data-driven solutions still require great care to correctly design, but with the right domain expertise they can improve on pathological cases in DDoS defence, such as protecting legitimate UDP traffic. In-network aggregation to histograms is shown to enable accurate classification from fine temporal effects, and allows hosts to scale such classification to far larger flow counts and traffic volume. Moving reinforcement learning to the dataplane is shown to offer substantial benefits to stateaction latency and online learning throughput versus host machines; allowing policies to react faster to fine-grained network events. The dataplane environment is key in making reactive online learning feasible—to port further algorithms and learnt functions, I collate and analyse the strengths of current and future hardware designs, as well as individual algorithms
    corecore