6 research outputs found

    Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms

    Get PDF
    Abstract Analyzing data streams has received considerable attention over the past decades due to the widespread usage of sensors, social media and other streaming data sources. A core research area in this field is stream clustering which aims to recognize patterns in an unordered, infinite and evolving stream of observations. Clustering can be a crucial support in decision making, since it aims for an optimized aggregated representation of a continuous data stream over time and allows to identify patterns in large and high-dimensional data. A multitude of algorithms and approaches has been developed that are able to find and maintain clusters over time in the challenging streaming scenario. This survey explores, summarizes and categorizes a total of 51 stream clustering algorithms and identifies core research threads over the past decades. In particular, it identifies categories of algorithms based on distance thresholds, density grids and statistical models as well as algorithms for high dimensional data. Furthermore, it discusses applications scenarios, available software and how to configure stream clustering algorithms. This survey is considerably more extensive than comparable studies, more up-to-date and highlights how concepts are interrelated and have been developed over time

    Building environmentally-aware classifiers on streaming data

    Get PDF
    The three biggest challenges currently faced in machine learning, in our estimation, are the staggering quantity of data we wish to analyze, the incredibly small proportion of these data that are labeled, and the apparent lack of interest in creating algorithms that continually learn during inference. An unsupervised streaming approach addresses all three of these challenges, storing only a finite amount of information to model an unbounded dataset and adapting to new structures as they arise. Specifically, we are motivated by automated target recognition (ATR) in synthetic aperture sonar (SAS) imagery, the problem of finding explosive hazards on the sea oor. It has been shown that the performance of ATR can be improved by, instead of using a single classifier for the entire ATR task, creating several specialized classifers and fusing their predictions [44]. The prevailing opinion seems be that one should have different classifiers for varying complexity of sea oor [74], but we hypothesize that fusing classifiers based on sea bottom type will yield higher accuracy and better lend itself to making explainable classification decisions. The first step of building such a system is developing a robust framework for online texture classification, the topic of this research. xi In this work, we improve upon StreamSoNG [85], an existing algorithm for streaming data analysis (SDA) that models each structure in the data with a neural gas [69] and detects new structures by clustering an outlier list with the possibilistic 1-means [62] (P1M) algorithm. We call the modified algorithm StreamSoNGv2, denoting that it is the second version, or verse, if you will, of StreamSoNG. Notable improvements include detection of arbitrarily-shaped clusters by using DBSCAN [37] instead of P1M, using growing neural gas [43] to model each structure with an adaptive number of prototypes, and an automated approach to estimate the n parameters. Furthermore, we propose a novel algorithm called single-pass possibilistic clustering (SPC) for solving the same task. SPC maintains a fixed number of structures to model the data stream. These structures can be updated and merged based only on their "footprints", that is, summary statistics that contain all of the information from the stream needed by the algorithm without directly maintaining the entire stream. SPC is built on a damped window framework, allowing the user to balance the weight between old and new points in the stream with a decay factor parameter. We evaluate the two algorithms under consideration against four state of the art SDA algorithms from the literature on several synthetic datasets and two texture datasets: one real (KTH-TIPS2b [68]) and xii one simulated. The simulated dataset, a significant research effort in itself, is of our own construction in Unreal Engine and contains on the order of 6,000 images at 720 x 720 resolution from six different texture types. Our hope is that the methodology developed here will be effective texture classifiers for use not only in underwater scene understanding, but also in improving performance of ATR algorithms by providing a context in which the potential target is embedded.Includes bibliographical references

    A Clustering System for Dynamic Data Streams Based on Metaheuristic Optimisation

    Get PDF
    This article presents the Optimised Stream clustering algorithm (OpStream), a novel approach to cluster dynamic data streams. The proposed system displays desirable features, such as a low number of parameters and good scalability capabilities to both high-dimensional data and numbers of clusters in the dataset, and it is based on a hybrid structure using deterministic clustering methods and stochastic optimisation approaches to optimally centre the clusters. Similar to other state-of-the-art methods available in the literature, it uses “microclusters” and other established techniques, such as density based clustering. Unlike other methods, it makes use of metaheuristic optimisation to maximise performances during the initialisation phase, which precedes the classic online phase. Experimental results show that OpStream outperforms the state-of-the-art methods in several cases, and it is always competitive against other comparison algorithms regardless of the chosen optimisation method. Three variants of OpStream, each coming with a different optimisation algorithm, are presented in this study. A thorough sensitive analysis is performed by using the best variant to point out OpStream’s robustness to noise and resiliency to parameter change

    A Frequent Pattern Conjunction Heuristic for Rule Generation in Data Streams

    Get PDF
    This paper introduces a new and expressive algorithm for inducing descriptive rule-sets from streaming data in real-time in order to describe frequent patterns explicitly encoded in the stream. Data Stream Mining (DSM) is concerned with the automatic analysis of data streams in real-time. Rapid flows of data challenge the state-of-the art processing and communication infrastructure, hence the motivation for research and innovation into real-time algorithms that analyse data streams on-the-fly and can automatically adapt to concept drifts. To date, DSM techniques have largely focused on predictive data mining applications that aim to forecast the value of a particular target feature of unseen data instances, answering questions such as whether a credit card transaction is fraudulent or not. A real-time, expressive and descriptive Data Mining technique for streaming data has not been previously established as part of the DSM toolkit. This has motivated the work reported in this paper, which has resulted in developing and validating a Generalised Rule Induction (GRI) tool, thus producing expressive rules as explanations that can be easily understood by human analysts. The expressiveness of decision models in data streams serves the objectives of transparency, underpinning the vision of `explainable AI’ and yet is an area of research that has attracted less attention despite being of high practical importance. The algorithm introduced and described in this paper is termed Fast Generalised Rule Induction (FGRI). FGRI is able to induce descriptive rules incrementally for raw data from both categorical and numerical features. FGRI is able to adapt rule-sets to changes of the pattern encoded in the data stream (concept drift) on the fly as new data arrives and can thus be applied continuously in real-time. The paper also provides a theoretical, qualitative and empirical evaluation of FGRI
    corecore