359 research outputs found

    Outlier Detection In Big Data

    Get PDF
    The dissertation focuses on scaling outlier detection to work both on huge static as well as on dynamic streaming datasets. Outliers are patterns in the data that do not conform to the expected behavior. Outlier detection techniques are broadly applied in applications ranging from credit fraud prevention, network intrusion detection to stock investment tactical planning. For such mission critical applications, a timely response often is of paramount importance. Yet the processing of outlier detection requests is of high algorithmic complexity and resource consuming. In this dissertation we investigate the challenges of detecting outliers in big data -- in particular caused by the high velocity of streaming data, the big volume of static data and the large cardinality of the input parameter space for tuning outlier mining algorithms. Effective optimization techniques are proposed to assure the responsiveness of outlier detection in big data. In this dissertation we first propose a novel optimization framework called LEAP to continuously detect outliers over data streams. The continuous discovery of outliers is critical for a large range of online applications that monitor high volume continuously evolving streaming data. LEAP encompasses two general optimization principles that utilize the rarity of the outliers and the temporal priority relationships among stream data points. Leveraging these two principles LEAP not only is able to continuously deliver outliers with respect to a set of popular outlier models, but also provides near real-time support for processing powerful outlier analytics workloads composed of large numbers of outlier mining requests with various parameter settings. Second, we develop a distributed approach to efficiently detect outliers over massive-scale static data sets. In this big data era, as the volume of the data advances to new levels, the power of distributed compute clusters must be employed to detect outliers in a short turnaround time. In this research, our approach optimizes key factors determining the efficiency of distributed data analytics, namely, communication costs and load balancing. In particular we prove the traditional frequency-based load balancing assumption is not effective. We thus design a novel cost-driven data partitioning strategy that achieves load balancing. Furthermore, we abandon the traditional one detection algorithm for all compute nodes approach and instead propose a novel multi-tactic methodology which adaptively selects the most appropriate algorithm for each node based on the characteristics of the data partition assigned to it. Third, traditional outlier detection systems process each individual outlier detection request instantiated with a particular parameter setting one at a time. This is not only prohibitively time-consuming for large datasets, but also tedious for analysts as they explore the data to hone in on the most appropriate parameter setting or on the desired results. We thus design an interactive outlier exploration paradigm that is not only able to answer traditional outlier detection requests in near real-time, but also offers innovative outlier analytics tools to assist analysts to quickly extract, interpret and understand the outliers of interest. Our experimental studies including performance evaluation and user studies conducted on real world datasets including stock, sensor, moving object, and Geolocation datasets confirm both the effectiveness and efficiency of the proposed approaches

    Sampling Algorithms for Evolving Datasets

    Get PDF
    Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up the processing of analytic queries and data-mining tasks, to enhance query optimization, and to facilitate information integration. Most of the existing work on database sampling focuses on how to create or exploit a random sample of a static database, that is, a database that does not change over time. The assumption of a static database, however, severely limits the applicability of these techniques in practice, where data is often not static but continuously evolving. In order to maintain the statistical validity of the sample, any changes to the database have to be appropriately reflected in the sample. In this thesis, we study efficient methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions, updates, and deletions. We consider instances of the maintenance problem that arise when sampling from an evolving set, from an evolving multiset, from the distinct items in an evolving multiset, or from a sliding window over a data stream. Our algorithms completely avoid any accesses to the base data and can be several orders of magnitude faster than algorithms that do rely on such expensive accesses. The improved efficiency of our algorithms comes at virtually no cost: the resulting samples are provably uniform and only a small amount of auxiliary information is associated with the sample. We show that the auxiliary information not only facilitates efficient maintenance, but it can also be exploited to derive unbiased, low-variance estimators for counts, sums, averages, and the number of distinct items in the underlying dataset. In addition to sample maintenance, we discuss methods that greatly improve the flexibility of random sampling from a system's point of view. More specifically, we initiate the study of algorithms that resize a random sample upwards or downwards. Our resizing algorithms can be exploited to dynamically control the size of the sample when the dataset grows or shrinks; they facilitate resource management and help to avoid under- or oversized samples. Furthermore, in large-scale databases with data being distributed across several remote locations, it is usually infeasible to reconstruct the entire dataset for the purpose of sampling. To address this problem, we provide efficient algorithms that directly combine the local samples maintained at each location into a sample of the global dataset. We also consider a more general problem, where the global dataset is defined as an arbitrary set or multiset expression involving the local datasets, and provide efficient solutions based on hashing

    Probabilistic data types

    Get PDF
    Dissertação de mestrado integrado em Engenharia InformáticaConflict-Free Replicated Data Types (CRDTs) provide deterministic outcomes from concurrent executions. The conflict resolution mechanism uses information on the ordering of the last operations performed, which indicates if a given operation is known by a replica, typically using some variant of version vectors. This thesis will explore the construction of CRDTs that use a novel stochastic mechanism that can track with high accuracy knowledge of the occurrence of recently performed operations and with less accuracy for older operations. The aim is to obtain better scaling properties and avoid the use of metadata that is linear on the number of replicas.Conflict-Free Replicated Data Types (CRDTs) oferecem resultados determinísticos de execuções concorrentes. O mecanismo de resolução de conflitos usa informação sobre a ordenação das últimas operações realizadas, que indica se uma dada operação é conhecida por uma réplica, geralmente usando alguma variante de version vectors. Esta tese explorara a construção de CRDTs que utilizam um novo mecanismo estocástico que pode identificar com alta precisão o conhecimento sobre a ocorrência de operações realizadas recentemente e com menor precisão para operações mais antigas. O objetivo é a obtenção de melhores propriedades de escalabilidade e evitar o uso de metadados em quantidade linear em relação ao número de réplicas

    Streaming Algorithms for High Throughput Massive Datasets

    Get PDF
    The field of streaming algorithms has enjoyed a deal of focus from the theoretical computer science community over the last 20 years. Many great algorithms and mathematical results have been developed in this time, allowing for a broad class of functions to be computed and problems to be solved in the streaming model. In the same amount of time, the amount of data being generated by practical computer systems is simply staggering. In this thesis, we focus on solving problems in the streaming model that have a unified goal of being relevant to practical problems outside of the theory community. In terms of a common technical thread throughout this work, the theme here is an attention to runtime and the ability to handle large datasets that not only challenge in terms of memory available, but also in the throughput of the data and the speed at which the data must be processed. We provide these solutions in the form of both theoretical algorithm and practical systems, and demonstrate that using practice to drive theory, and vice versa, can generate powerful new approaches for difficult problems in the streaming model

    Improving Group Integrity of Tags in RFID Systems

    Get PDF
    Checking the integrity of groups containing radio frequency identification (RFID) tagged objects or recovering the tag identifiers of missing objects is important in many activities. Several autonomous checking methods have been proposed for increasing the capability of recovering missing tag identifiers without external systems. This has been achieved by treating a group of tag identifiers (IDs) as packet symbols encoded and decoded in a way similar to that in binary erasure channels (BECs). Redundant data are required to be written into the limited memory space of RFID tags in order to enable the decoding process. In this thesis, the group integrity of passive tags in RFID systems is specifically targeted, with novel mechanisms being proposed to improve upon the current state of the art. Due to the sparseness property of low density parity check (LDPC) codes and the mitigation of the progressive edge-growth (PEG) method for short cycles, the research is begun with the use of the PEG method in RFID systems to construct the parity check matrix of LDPC codes in order to increase the recovery capabilities with reduced memory consumption. It is shown that the PEG-based method achieves significant recovery enhancements compared to other methods with the same or less memory overheads. The decoding complexity of the PEG-based LDPC codes is optimised using an improved hybrid iterative/Gaussian decoding algorithm which includes an early stopping criterion. The relative complexities of the improved algorithm are extensively analysed and evaluated, both in terms of decoding time and the number of operations required. It is demonstrated that the improved algorithm considerably reduces the operational complexity and thus the time of the full Gaussian decoding algorithm for small to medium amounts of missing tags. The joint use of the two decoding components is also adapted in order to avoid the iterative decoding when the missing amount is larger than a threshold. The optimum value of the threshold value is investigated through empirical analysis. It is shown that the adaptive algorithm is very efficient in decreasing the average decoding time of the improved algorithm for large amounts of missing tags where the iterative decoding fails to recover any missing tag. The recovery performances of various short-length irregular PEG-based LDPC codes constructed with different variable degree sequences are analysed and evaluated. It is demonstrated that the irregular codes exhibit significant recovery enhancements compared to the regular ones in the region where the iterative decoding is successful. However, their performances are degraded in the region where the iterative decoding can recover some missing tags. Finally, a novel protocol called the Redundant Information Collection (RIC) protocol is designed to filter and collect redundant tag information. It is based on a Bloom filter (BF) that efficiently filters the redundant tag information at the tag’s side, thereby considerably decreasing the communication cost and consequently, the collection time. It is shown that the novel protocol outperforms existing possible solutions by saving from 37% to 84% of the collection time, which is nearly four times the lower bound. This characteristic makes the RIC protocol a promising candidate for collecting redundant tag information in the group integrity of tags in RFID systems and other similar ones

    Clustering information entities based on statistical methods

    Get PDF
    [no abstract
    corecore