45 research outputs found

    Atypicity Detection in Data Streams: a Self-Adjusting Approach

    Get PDF
    International audienceOutlyingness is a subjective concept relying on the isolation level of a (set of) record(s). Clustering-based outlier detection is a field that aims to cluster data and to detect outliers depending on their characteristics (i.e. small, tight and/or dense clusters might be considered as outliers). Existing methods require a parameter standing for the "level of outlyingness", such as the maximum size or a percentage of small clusters, in order to build the set of outliers. Unfortunately, manually setting this parameter in a streaming environment should not be possible, given the fast time response usually needed. In this paper we propose WOD, a method that separates outliers from clusters thanks to a natural and effective principle. The main advantages of WOD are its ability to automatically adjust to any clustering result and to be parameterless

    Anomaly Detections for Manufacturing Systems Based on Sensor Data—Insights into Two Challenging Real-World Production Settings

    Get PDF
    To build, run, and maintain reliable manufacturing machines, the condition of their components has to be continuously monitored. When following a fine-grained monitoring of these machines, challenges emerge pertaining to the (1) feeding procedure of large amounts of sensor data to downstream processing components and the (2) meaningful analysis of the produced data. Regarding the latter aspect, manifold purposes are addressed by practitioners and researchers. Two analyses of real-world datasets that were generated in production settings are discussed in this paper. More specifically, the analyses had the goals (1) to detect sensor data anomalies for further analyses of a pharma packaging scenario and (2) to predict unfavorable temperature values of a 3D printing machine environment. Based on the results of the analyses, it will be shown that a proper management of machines and their components in industrial manufacturing environments can be efficiently supported by the detection of anomalies. The latter shall help to support the technical evangelists of the production companies more properly

    A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets

    Get PDF
    The term "outlier" can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework

    Automatically Selecting Parameters for Graph-Based Clustering

    Get PDF
    Data streams present a number of challenges, caused by change in stream concepts over time. In this thesis we present a novel method for detection of concept drift within data streams by analysing geometric features of the clustering algorithm, RepStream. Further, we present novel methods for automatically adjusting critical input parameters over time, and generating self-organising nearest-neighbour graphs, improving robustness and decreasing the need to domain-specific knowledge in the face of stream evolution

    No-Reference Quality Assessment of digital images

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Data stream mining: methods and challenges for handling concept drift.

    Get PDF
    Mining and analysing streaming data is crucial for many applications, and this area of research has gained extensive attention over the past decade. However, there are several inherent problems that continue to challenge the hardware and the state-of-the art algorithmic solutions. Examples of such problems include the unbound size, varying speed and unknown data characteristics of arriving instances from a data stream. The aim of this research is to portray key challenges faced by algorithmic solutions for stream mining, particularly focusing on the prevalent issue of concept drift. A comprehensive discussion of concept drift and its inherent data challenges in the context of stream mining is presented, as is a critical, in-depth review of relevant literature. Current issues with the evaluative procedure for concept drift detectors is also explored, highlighting problems such as a lack of established base datasets and the impact of temporal dependence on concept drift detection. By exposing gaps in the current literature, this study suggests recommendations for future research which should aid in the progression of stream mining and concept drift detection algorithms

    Online Analysis of Dynamic Streaming Data

    Get PDF
    Die Arbeit zum Thema "Online Analysis of Dynamic Streaming Data" beschäftigt sich mit der Distanzmessung dynamischer, semistrukturierter Daten in kontinuierlichen Datenströmen um Analysen auf diesen Datenstrukturen bereits zur Laufzeit zu ermöglichen. Hierzu wird eine Formalisierung zur Distanzberechnung für statische und dynamische Bäume eingeführt und durch eine explizite Betrachtung der Dynamik von Attributen einzelner Knoten der Bäume ergänzt. Die Echtzeitanalyse basierend auf der Distanzmessung wird durch ein dichte-basiertes Clustering ergänzt, um eine Anwendung des Clustering, einer Klassifikation, aber auch einer Anomalieerkennung zu demonstrieren. Die Ergebnisse dieser Arbeit basieren auf einer theoretischen Analyse der eingeführten Formalisierung von Distanzmessungen für dynamische Bäume. Diese Analysen werden unterlegt mit empirischen Messungen auf Basis von Monitoring-Daten von Batchjobs aus dem Batchsystem des GridKa Daten- und Rechenzentrums. Die Evaluation der vorgeschlagenen Formalisierung sowie der darauf aufbauenden Echtzeitanalysemethoden zeigen die Effizienz und Skalierbarkeit des Verfahrens. Zudem wird gezeigt, dass die Betrachtung von Attributen und Attribut-Statistiken von besonderer Bedeutung für die Qualität der Ergebnisse von Analysen dynamischer, semistrukturierter Daten ist. Außerdem zeigt die Evaluation, dass die Qualität der Ergebnisse durch eine unabhängige Kombination mehrerer Distanzen weiter verbessert werden kann. Insbesondere wird durch die Ergebnisse dieser Arbeit die Analyse sich über die Zeit verändernder Daten ermöglicht

    Applied Randomized Algorithms for Efficient Genomic Analysis

    Get PDF
    The scope and scale of biological data continues to grow at an exponential clip, driven by advances in genetic sequencing, annotation and widespread adoption of surveillance efforts. For instance, the Sequence Read Archive (SRA) now contains more than 25 petabases of public data, while RefSeq, a collection of reference genomes, recently surpassed 100,000 complete genomes. In the process, it has outgrown the practical reach of many traditional algorithmic approaches in both time and space. Motivated by this extreme scale, this thesis details efficient methods for clustering and summarizing large collections of sequence data. While our primary area of interest is biological sequences, these approaches largely apply to sequence collections of any type, including natural language, software source code, and graph structured data. We applied recent advances in randomized algorithms to practical problems. We used MinHash and HyperLogLog, both examples of Locality- Sensitive Hashing, as well as coresets, which are approximate representations for finite sum problems, to build methods capable of scaling to billions of items. Ultimately, these are all derived from variations on sampling. We combined these advances with hardware-based optimizations and incorporated into free and open-source software libraries (sketch, frp, lib- simdsampling) and practical software tools built on these libraries (Dashing, Minicore, Dashing 2), empowering users to interact practically with colossal datasets on commodity hardware

    Significant Feature Clustering

    Get PDF
    In this thesis, we present a new clustering algorithm we call Significance Feature Clustering, which is designed to cluster text documents. Its central premise is the mapping of raw frequency count vectors to discrete-valued significance vectors which contain values of -1, 0, or 1. These values represent whether a word is significantly negative, neutral, or significantly positive, respectively. Initially, standard tf-idf vectors are computed from raw frequency vectors, then these tf-idf vectors are transformed to significance vectors using a parameter alpha, where alpha controls the mapping -1, 0, or 1 for each vector entry. SFC clusters agglomeratively, with each document's significance vector representing a cluster of size one containing just the document, and iteratively merges the two clusters that exhibit the most similar average using cosine similarity. We show that by using a good alpha value, the significance vectors produced by SFC provide an accurate indication of which words are significant to which documents, as well as the type of significance, and therefore correspondingly yield a good clustering in terms of a well-known definition of clustering quality. We further demonstrate that a user need not manually select an alpha as we develop a new definition of clustering quality that is highly correlated with text clustering quality. Our metric extends the family of metrics known as internal similarity, so that it can be applied to a tree of clusters rather than a set, but it also factors in an aspect of recall that was absent from previous internal similarity metrics. Using this new definition of internal similarity, which we call maximum tree internal similarity, we show that a close to optimal text clustering may be picked from any number of clusterings created by different alpha's. The automatically selected clusterings have qualities that are close to that of a well-known and powerful hierarchical clustering algorithm
    corecore