11 research outputs found

    BIG DATA ANALYTICS METHODS USING GPU : A COMPREHENSIVE SURVEY

    Get PDF
    Big data analytics is eventual discovery of knowledge from large set of data thus leading to business benefits. Its biggest challenge is the ability to provide information within reasonable time. The traditional analytics methods might fail to produce efficient result when data handled is of large size. As part of enhancing the performance, the researchers incorporated Graphical Processing Unit (GPU) on big data. GPU being the soul of computer delivers high performance by using its multi core parallel architecture. This paper investigates some methods of integrating GPU on analytics of big data that solely delivered high performance when compared to conventional schemes

    Efficient Approximate Big Data Clustering: Distributed and Parallel Algorithms in the Spectrum of IoT Architectures

    Get PDF
    Clustering, the task of grouping together similar items, is a frequently used method for processing data, with numerous applications. Clustering the data generated by sensors in the Internet of Things, for instance, can be useful for monitoring and making control decisions. For example, a cyber physical environment can be monitored by one or more 3D laser-based sensors to detect the objects in that environment and avoid critical situations, e.g. collisions.With the advancements in IoT-based systems, the volume of data produced by, typically high-rate, sensors has become immense. For example, a 3D laser-based sensor with a spinning head can produce hundreds of thousands of points in each second. Clustering such a large volume of data using conventional clustering methods takes too long time, violating the time-sensitivity requirements of applications leveraging the outcome of the clustering. For example, collisions in a cyber physical environment must be prevented as fast as possible.The thesis contributes to efficient clustering methods for distributed and parallel computing architectures, representative of the processing environments in IoT- based systems. To that end, the thesis proposes MAD-C (abbreviating Multi-stage Approximate Distributed Cluster-Combining) and PARMA-CC (abbreviating Parallel Multiphase Approximate Cluster Combining). MAD-C is a method for distributed approximate data clustering. MAD-C employs an approximation-based data synopsis that drastically lowers the required communication bandwidth among the distributed nodes and achieves multiplicative savings in computation time, compared to a baseline that centrally gathers and clusters the data. PARMA-CC is a method for parallel approximate data clustering on multi-cores. Employing approximation-based data synopsis, PARMA-CC achieves scalability on multi-cores by increasing the synergy between the work-sharing procedure and data structures to facilitate highly parallel execution of threads. The thesis provides analytical and empirical evaluation for MAD-C and PARMA-CC

    A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets

    Get PDF
    The term "outlier" can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework

    Finding Anomalous Periodic Time Series: An Application to Catalogs of Periodic Variable Stars

    Full text link
    Catalogs of periodic variable stars contain large numbers of periodic light-curves (photometric time series data from the astrophysics domain). Separating anomalous objects from well-known classes is an important step towards the discovery of new classes of astronomical objects. Most anomaly detection methods for time series data assume either a single continuous time series or a set of time series whose periods are aligned. Light-curve data precludes the use of these methods as the periods of any given pair of light-curves may be out of sync. One may use an existing anomaly detection method if, prior to similarity calculation, one performs the costly act of aligning two light-curves, an operation that scales poorly to massive data sets. This paper presents PCAD, an unsupervised anomaly detection method for large sets of unsynchronized periodic time-series data, that outputs a ranked list of both global and local anomalies. It calculates its anomaly score for each light-curve in relation to a set of centroids produced by a modified k-means clustering algorithm. Our method is able to scale to large data sets through the use of sampling. We validate our method on both light-curve data and other time series data sets. We demonstrate its effectiveness at finding known anomalies, and discuss the effect of sample size and number of centroids on our results. We compare our method to naive solutions and existing time series anomaly detection methods for unphased data, and show that PCAD's reported anomalies are comparable to or better than all other methods. Finally, astrophysicists on our team have verified that PCAD finds true anomalies that might be indicative of novel astrophysical phenomena

    Efficient biased sampling for approximate clustering and outlier detection in large data sets

    No full text

    Model based fault diagnosis and prognosis of nonlinear systems

    Get PDF
    Rapid technological advances have led to more and more complex industrial systems with significantly higher risk of failures. Therefore, in this dissertation, a model-based fault diagnosis and prognosis framework has been developed for fast and reliable detection of faults and prediction of failures in nonlinear systems. In the first paper, a unified model-based fault diagnosis scheme capable of detecting both additive system faults and multiplicative actuator faults, as well as approximating the fault dynamics, performing fault type determination and time-to-failure determination, is designed. Stability of the observer and online approximator is guaranteed via an adaptive update law. Since outliers can degrade the performance of fault diagnostics, the second paper introduces an online neural network (NN) based outlier identification and removal scheme which is then combined with a fault detection scheme to enhance its performance. Outliers are detected based on the estimation error and a novel tuning law prevents the NN weights from being affected by outliers. In the third paper, in contrast to papers I and II, fault diagnosis of large-scale interconnected systems is investigated. A decentralized fault prognosis scheme is developed for such systems by using a network of local fault detectors (LFD) where each LFD only requires the local measurements. The online approximators in each LFD learn the unknown interconnection functions and the fault dynamics. Derivation of robust detection thresholds and detectability conditions are also included. The fourth paper extends the decentralized fault detection from paper III and develops an accommodation scheme for nonlinear continuous-time systems. By using both detection and accommodation online approximators, the control inputs are adjusted in order to minimize the fault effects. Finally in the fifth paper, the model-based fault diagnosis of distributed parameter systems (DPS) with parabolic PDE representation in continuous-time is discussed where a PDE-based observer is designed to perform fault detection as well as estimating the unavailable system states. An adaptive online approximator is incorporated in the observer to identify unknown fault parameters. Adaptive update law guarantees the convergence of estimations and allows determination of remaining useful life --Abstract, page iv

    Calibration and Evaluation of Outlier Detection with Generated Data

    Get PDF
    Outlier detection is an essential part of data science --- an area with increasing relevance in a plethora of domains. While there already exist numerous approaches for the detection of outliers, some significant challenges remain relevant. Two prominent such challenges are that outliers are rare and not precisely defined. They both have serious consequences, especially on the calibration and evaluation of detection methods. This thesis is concerned with a possible way of dealing with these challenges: the generation of outliers. It discusses existing techniques for generating outliers but specifically also their use in tackling the mentioned challenges. In the literature, the topic of outlier generation seems to have only little general structure so far --- despite that many techniques were already proposed. Thus, the first contribution of this thesis is a unified and crisp description of the state-of-the-art in outlier generation and their usages. Given the variety of characteristics of the generated outliers and the variety of methods designed for the detection of real outliers, it becomes apparent that a comparison of detection performance should be more distinctive than state-of-the-art comparisons are. Such a distinctive comparison is tackled in the second central contribution of this thesis: a general process for the distinctive evaluation of outlier detection methods with generated data. The process developed in this thesis uses entirely artificial data in which the inliers are realistic representations of some real-world data and the outliers deviations from these inliers with specific characteristics. The realness of the inliers allows the generalization of performance evaluations to many other data domains. The carefully designed generation techniques for outliers allow insights on the effect of the characteristics of outliers. So-called hidden outliers represent a special type of outliers: they also depend on a set of selections of data attributes, i.e., a set of subspaces. Hidden outliers are only detectable in a particular set of subspaces. In the subspaces they are hidden from, they are not detectable. For outlier detection methods that make use of subspaces, hidden outliers are a blind-spot: if they hide from the subspaces, searched for outliers. Thus, hidden outliers are exciting to study, for the evaluation of detection methods that use subspaces in particular. The third central contribution of this thesis is a technique for the generation of hidden outliers. An analysis of the characteristics of such instances is featured as well. First, the concept of hidden outliers is broached theoretical for this analysis. Then the developed technique is also used to validate the theoretical findings in more realistic contexts. For example, to show that hidden outliers could appear in many real-world data sets. All in all, this dissertation gives the field of outlier generation needed structure and shows their usefulness in tackling prominent challenges of the outlier detection problem
    corecore