525 research outputs found

    Graph Sample and Hold: A Framework for Big-Graph Analytics

    Full text link
    Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks etc), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy. While previous work focused particularly on sampling schemes used to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we propose a generic stream sampling framework for big-graph analytics, called Graph Sample and Hold (gSH). To begin, the proposed framework samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state. We then show how to produce unbiased estimators for various graph properties from the sample. Given that the graph analysis algorithms will run on a sample instead of the whole population, the runtime complexity of these algorithm is kept under control. Moreover, given that the estimators of graph properties are unbiased, the approximation error is kept under control. Finally, we show the performance of the proposed framework (gSH) on various types of graphs, such as social graphs, among others

    Get the Most out of Your Sample: Optimal Unbiased Estimators using Partial Information

    Full text link
    Random sampling is an essential tool in the processing and transmission of data. It is used to summarize data too large to store or manipulate and meet resource constraints on bandwidth or battery power. Estimators that are applied to the sample facilitate fast approximate processing of queries posed over the original data and the value of the sample hinges on the quality of these estimators. Our work targets data sets such as request and traffic logs and sensor measurements, where data is repeatedly collected over multiple {\em instances}: time periods, locations, or snapshots. We are interested in queries that span multiple instances, such as distinct counts and distance measures over selected records. These queries are used for applications ranging from planning to anomaly and change detection. Unbiased low-variance estimators are particularly effective as the relative error decreases with the number of selected record keys. The Horvitz-Thompson estimator, known to minimize variance for sampling with "all or nothing" outcomes (which reveals exacts value or no information on estimated quantity), is not optimal for multi-instance operations for which an outcome may provide partial information. We present a general principled methodology for the derivation of (Pareto) optimal unbiased estimators over sampled instances and aim to understand its potential. We demonstrate significant improvement in estimate accuracy of fundamental queries for common sampling schemes.Comment: This is a full version of a PODS 2011 pape

    Data-driven design of intelligent wireless networks: an overview and tutorial

    Get PDF
    Data science or "data-driven research" is a research approach that uses real-life data to gain insight about the behavior of systems. It enables the analysis of small, simple as well as large and more complex systems in order to assess whether they function according to the intended design and as seen in simulation. Data science approaches have been successfully applied to analyze networked interactions in several research areas such as large-scale social networks, advanced business and healthcare processes. Wireless networks can exhibit unpredictable interactions between algorithms from multiple protocol layers, interactions between multiple devices, and hardware specific influences. These interactions can lead to a difference between real-world functioning and design time functioning. Data science methods can help to detect the actual behavior and possibly help to correct it. Data science is increasingly used in wireless research. To support data-driven research in wireless networks, this paper illustrates the step-by-step methodology that has to be applied to extract knowledge from raw data traces. To this end, the paper (i) clarifies when, why and how to use data science in wireless network research; (ii) provides a generic framework for applying data science in wireless networks; (iii) gives an overview of existing research papers that utilized data science approaches in wireless networks; (iv) illustrates the overall knowledge discovery process through an extensive example in which device types are identified based on their traffic patterns; (v) provides the reader the necessary datasets and scripts to go through the tutorial steps themselves

    Stream Sampling for Frequency Cap Statistics

    Full text link
    Unaggregated data, in streamed or distributed form, is prevalent and come from diverse application domains which include interactions of users with web services and IP traffic. Data elements have {\em keys} (cookies, users, queries) and elements with different keys interleave. Analytics on such data typically utilizes statistics stated in terms of the frequencies of keys. The two most common statistics are {\em distinct}, which is the number of active keys in a specified segment, and {\em sum}, which is the sum of the frequencies of keys in the segment. Both are special cases of {\em cap} statistics, defined as the sum of frequencies {\em capped} by a parameter TT, which are popular in online advertising platforms. Aggregation by key, however, is costly, requiring state proportional to the number of distinct keys, and therefore we are interested in estimating these statistics or more generally, sampling the data, without aggregation. We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and state proportional to the desired sample size. Our design provides the first effective solution for general frequency cap statistics. Our ℓ\ell-capped samples provide estimates with tight statistical guarantees for cap statistics with T=Θ(ℓ)T=\Theta(\ell) and nonnegative unbiased estimates of {\em any} monotone non-decreasing frequency statistics. An added benefit of our unified design is facilitating {\em multi-objective samples}, which provide estimates with statistical guarantees for a specified set of different statistics, using a single, smaller sample.Comment: 21 pages, 4 figures, preliminary version will appear in KDD 201

    Privacy Violation and Detection Using Pattern Mining Techniques

    Get PDF
    Privacy, its violations and techniques to bypass privacy violation have grabbed the centre-stage of both academia and industry in recent months. Corporations worldwide have become conscious of the implications of privacy violation and its impact on them and to other stakeholders. Moreover, nations across the world are coming out with privacy protecting legislations to prevent data privacy violations. Such legislations however expose organizations to the issues of intentional or unintentional violation of privacy data. A violation by either malicious external hackers or by internal employees can expose the organizations to costly litigations. In this paper, we propose PRIVDAM; a data mining based intelligent architecture of a Privacy Violation Detection and Monitoring system whose purpose is to detect possible privacy violations and to prevent them in the future. Experimental evaluations show that our approach is scalable and robust and that it can detect privacy violations or chances of violations quite accurately. Please contact the author for full text at [email protected]

    Improving video streaming experience through network measurements and analysis

    Get PDF
    Multimedia traffic dominates today’s Internet. In particular, the most prevalent traffic carried over wired and wireless networks is video. Most popular streaming providers (e.g. Netflix, Youtube) utilise HTTP adaptive streaming (HAS) for video content delivery to end-users. The power of HAS lies in the ability to change video quality in real time depending on the current state of the network (i.e. available network resources). The main goal of HAS algorithms is to maximise video quality while minimising re-buffering events and switching between different qualities. However, these requirements are opposite in nature, so striking a perfect blend is challenging, as there is no single widely accepted metric that captures user experience based on the aforementioned requirements. In recent years, researchers have put a lot of effort into designing subjectively validated metrics that can be used to map quality, re-buffering and switching behaviour of HAS players to the overall user experience (i.e. video QoE). This thesis demonstrates how data analysis can contribute in improving video QoE. One of the main characteristics of mobile networks is frequent throughput fluctuations. There are various underlying factors that contribute to this behaviour, including rapid changes in the radio channel conditions, system load and interaction between feedback loops at the different time scales. These fluctuations highlight the challenge to achieve a high video user experience. In this thesis, we tackle this issue by exploring the possibility of throughput prediction in cellular networks. The need for better throughput prediction comes from data-based evidence that standard throughput estimation techniques (e.g. exponential moving average) exhibit low prediction accuracy. Cellular networks deploy opportunistic exponential scheduling algorithms (i.e. proportional-fair) for resource allocation among mobile users/devices. These algorithms take into account a user’s physical layer information together with throughput demand. While the algorithm itself is proprietary to the manufacturer, physical layer and throughput information are exchanged between devices and base stations. Availability of this information allows for a data-driven approach for throughput prediction. This thesis utilises a machine-learning approach to predict available throughput based on measurements in the near past. As a result, a prediction accuracy with an error less than 15% in 90% of samples is achieved. Adding information from other devices served by the same base station (network-based information) further improves accuracy while lessening the need for a large history (i.e. how far to look into the past). Finally, the throughput prediction technique is incorporated to state-of-the-art HAS algorithms. The approach is validated in a commercial cellular network and on a stock mobile device. As a result, better throughput prediction helps in improving user experience up to 33%, while minimising re-buffering events by up to 85%. In contrast to wireless networks, channel characteristics of the wired medium are more stable, resulting in less prominent throughput variations. However, all traffic traverses through network queues (i.e. a router or switch), unlike in cellular networks where each user gets a dedicated queue at the base station. Furthermore, network operators usually deploy a simple first-in-first-out queuing discipline at queues. As a result, traffic can experience excessive delays due to the large queue sizes, usually deployed in order to minimise packet loss and maximise throughput. This effect, also known as bufferbloat, negatively impacts delay-sensitive applications, such as web browsing and voice. While there exist guidelines for modelling queue size, there is no work analysing its impact on video streaming traffic generated by multiple users. To answer this question, the performance of multiple videos clients sharing a bottleneck link is analysed. Moreover, the analysis is extended to a realistic case including heterogeneous round-trip-time (RTT) and traffic (i.e. web browsing). Based on experimental results, a simple two queue discipline is proposed for scheduling heterogeneous traffic by taking into account application characteristics. As a result, compared to the state-of-the-art Active Queue Management (AQM) discipline, Controlled Delay Management (CoDel), the proposed discipline decreases median Page Loading Time (PLT) of web traffic by up to 80% compared to CoDel, with no significant negative impact on video QoE
    • …
    corecore