33 research outputs found

    A one-pass clustering based sketch method for network monitoring

    Get PDF
    Network monitoring solutions need to cope with increasing network traffic volumes, as a result, sketch-based monitoring methods have been extensively studied to trade accuracy for memory scalability and storage reduction. However, sketches are sensitive to skewness in network flow distributions due to hash collisions, and need complicated performance optimization to adapt to line-rate packet streams. We provide Jellyfish, an efficient sketch method that performs one-pass clustering over the network stream. One-pass clustering is realized by adapting the monitoring granularity from the whole network flow to fragments called subflows, which not only reduces the ingestion rate but also provides an efficient intermediate representation for the input to the sketch. Jellyfish provides the network-flow level query interface by reconstructing the network-flow level counters by merging subflow records from the same network flow. We provide probabilistic analysis of the expected accuracy of both existing sketch methods and Jellyfish. Real-world trace-driven experiments show that Jellyfish reduces the average estimation errors by up to six orders of magnitude for per-flow queries, by six orders of magnitude for entropy queries, and up to ten times for heavy-hitter queries.This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 61972409; in part by Hong Kong Research Grants Council (RGC) under Grant TRS T41-603/20-R, Grant GRF-16213621, and Grant ITF ACCESS; in part by the Spanish I+D+i project TRAINER-A, funded by MCIN/AEI/10.13039/501100011033, under Grant PID2020-118011GB-C21; and in part by the Catalan Institution for Research and Advanced Studies (ICREA Academia).Peer ReviewedPostprint (author's final draft

    DUET: A Generic Framework for Finding Special Quadratic Elements in Data Streams

    Get PDF
    Finding special items, like heavy hitters, top-k, and persistent items, has always been a hot issue in data stream processing for web analysis. While data streams nowadays are usually high-dimensional, most prior works focus on special items according to a certain primary dimension and yield little insight into the correlations between dimensions. Therefore, we propose to find special quadratic elements to reveal close correlations. Based on the items mentioned above, we extend our problem to three applications related to heavy hitters, top-k, and persistent items, and design a generic framework DUET to process them. Besides, we analyze the error bound of our algorithm and conduct extensive experiments on four data sets. Our experimental results show that DUET can achieve 3.5 times higher throughput and three orders of magnitude lower average relative error compared with cutting-edge algorithms

    Towards Scalable Network Traffic Measurement With Sketches

    Get PDF
    Driven by the ever-increasing data volume through the Internet, the per-port speed of network devices reached 400 Gbps, and high-end switches are capable of processing 25.6 Tbps of network traffic. To improve the efficiency and security of the network, network traffic measurement becomes more important than ever. For fast and accurate traffic measurement, managing an accurate working set of active flows (WSAF) at line rates is a key challenge. WSAF is usually located in high-speed but expensive memories, such as TCAM or SRAM, and thus their capacity is quite limited. To scale up the per-flow measurement, we pursue three thrusts. In the first thrust, we propose to use In-DRAM WSAF and put a compact data structure (i.e., sketch) called FlowRegulator before WSAF to compensate for DRAM\u27s slow access time. Per our results, FlowRegulator can substantially reduce massive influxes to WSAF without compromising measurement accuracy. In the second thrust, we integrate our sketch into a network system and propose an SDN-based WLAN monitoring and management framework called RFlow+, which can overcome the limitations of existing traffic measurement solutions (e.g., OpenFlow and sFlow), such as a limited view, incomplete flow statistics, and poor trade-off between measurement accuracy and CPU/network overheads. In the third thrust, we introduce a novel sampling scheme to deal with the poor trade-off that is provided by the standard simple random sampling (SRS). Even though SRS has been widely used in practice because of its simplicity, it provides non-uniform sampling rates for different flows, because it samples packets over an aggregated data flow. Starting with a simple idea that independent per-flow packet sampling provides the most accurate estimation of each flow, we introduce a new concept of per-flow systematic sampling, aiming to provide the same sampling rate across all flows. In addition, we provide a concrete sampling method called SketchFlow, which approximates the idea of the per-flow systematic sampling using a sketch saturation event

    Accurate and Resource-Efficient Monitoring for Future Networks

    Get PDF
    Monitoring functionality is a key component of any network management system. It is essential for profiling network resource usage, detecting attacks, and capturing the performance of a multitude of services using the network. Traditional monitoring solutions operate on long timescales producing periodic reports, which are mostly used for manual and infrequent network management tasks. However, these practices have been recently questioned by the advent of Software Defined Networking (SDN). By empowering management applications with the right tools to perform automatic, frequent, and fine-grained network reconfigurations, SDN has made these applications more dependent than before on the accuracy and timeliness of monitoring reports. As a result, monitoring systems are required to collect considerable amounts of heterogeneous measurement data, process them in real-time, and expose the resulting knowledge in short timescales to network decision-making processes. Satisfying these requirements is extremely challenging given today’s larger network scales, massive and dynamic traffic volumes, and the stringent constraints on time availability and hardware resources. This PhD thesis tackles this important challenge by investigating how an accurate and resource-efficient monitoring function can be realised in the context of future, software-defined networks. Novel monitoring methodologies, designs, and frameworks are provided in this thesis, which scale with increasing network sizes and automatically adjust to changes in the operating conditions. These achieve the goal of efficient measurement collection and reporting, lightweight measurement- data processing, and timely monitoring knowledge delivery

    Sketching as a Tool for Efficient Networked Systems

    Get PDF
    Today, computer systems need to cope with the explosive growth of data in the world. For instance, in data-center networks, monitoring systems are used to measure traffic statistics at high speed; and in financial technology companies, distributed processing systems are deployed to support graph analytics. To fulfill the requirements of handling such large datasets, we build efficient networked systems in a distributed manner most of the time. Ideally, we expect the systems to meet service-level objectives (SLOs) using the least amount of resource. However, existing systems constructed with conventional in-memory algorithms face the following challenges: (1) excessive resource requirements (e.g., CPU, ASIC, and memory) with high cost; (2) infeasibility in a larger scale; (3) processing the data too slowly to meet the objectives. To address these challenges, we propose sketching techniques as a tool to build more efficient networked systems. Sketching algorithms aim to process the data with one or several passes in an online, streaming fashion (e.g., a stream of network packets), and compute highly accurate results. With sketching, we only maintain a compact summary of the entire data and provide theoretical guarantees on error bounds. This dissertation argues for a sketching based design for large-scale networked systems, and demonstrates the benefits in three application contexts: (i) Network monitoring: we build generic monitoring frameworks that support a range of applications on both software and hardware with universal sketches. (ii) Graph pattern mining: we develop a swift, approximate graph pattern miner that scales to very large graphs by leveraging graph sketching techniques. (iii) Halo finding in N-body simulations: we design scalable halo finders on CPU and GPU by leveraging sketch-based heavy hitter algorithms

    Efficient Algorithms to Compute Hierarchical Summaries from Big Data Streams

    Full text link
    Many data stream applications have hierarchical data; containing time, geographic locations, product information, clickstreams, server logs, IP addresses. A hierarchical summary of such volumous data offers multiple advantages including compactness, quick understanding, and abstraction. The goal of this thesis is to design algorithmic approaches for summarizing hierarchical data streams. First, this thesis provides a theoretical analysis of the benchmark hierarchical heavy hitters' algorithms and uncovers their shortcomings such as requiring high theoretical memory, updates and coverage problem. To address these shortcomings, this thesis proposes efficient algorithms which offer deterministic estimation accuracy using O(η/ε) worst-case memory and O(η) worst-case time complexity per item, where ε ∈ [0,1] is a user defined parameter and η is a small constant derived from the data. The proposed hierarchical heavy hitters' algorithms are shown to have improved significantly over existing algorithms both theoretically as well as empirically. Next, this thesis introduces a new concept called hierarchically correlated heavy hitters, which is different from existing hierarchical summarization techniques. The thesis provides a formal definition of the proposed concept and compares it with existing hierarchical summarization approaches both at definition level and empirically. It also proposes an efficient hierarchy-aware algorithm for computing hierarchically correlated heavy hitters. The proposed algorithm offers deterministic estimation accuracy using O(η / (ε_p * ε_s )) worst-case memory and O(η) worst-case time complexity per item, where η is as defined previously, and ε_p ∈ [0,1], ε_s ∈ [0,1] are other user defined parameters. Finally, the thesis proposes a special hierarchical data structure and algorithm to summarize spatiotemporal data. It can be used to extract interesting and useful patterns from high-speed spatiotemporal data streams at multiple spatial and temporal granularities. Theoretical and empirical analysis are provided, which show that the proposed data structure is very efficient concerning data storage and response to queries. It updates a single item in O(1) time and responds to a point query in O(1) time. Importantly, the memory requirement of the proposed data structure is independent of the size of the data and only depends on user-supplied parameters ψ ⃗ and φ ⃗. In summary, this thesis provides a general framework consisting of a set of algorithms and data structures to compute hierarchical summaries of the big data streams. All of the proposed algorithms exploit a lattice structure built from the hierarchical attributes of the data to compute different hierarchical summaries, which can be used to address various data analytic issues in many emerging applications
    corecore