2,164 research outputs found

    Fully online clustering of evolving data streams into arbitrarily shaped clusters

    Get PDF
    In recent times there has been an increase in data availability in continuous data streams and clustering of this data has many advantages in data analysis. It is often the case that these data streams are not stationary, but evolve over time, and also that the clusters are not regular shapes but form arbitrary shapes in the data space. Previous techniques for clustering such data streams are either hybrid online / offline methods, windowed offline methods, or find only hyper-elliptical clusters. In this paper we present a fully online technique for clustering evolving data streams into arbitrary shaped clusters. It is a two stage technique that is accurate, robust to noise, computationally and memory efficient, with a low time penalty as the number of data dimensions increases. The first stage of the technique produces micro-clusters and the second stage combines these micro- clusters into macro-clusters. Dimensional stability and high speed is achieved through keeping the calculations both simple and minimal using hyper-spherical micro-clusters. By maintaining a graph structure, where the micro-clusters are the nodes and the edges are its pairs with intersecting micro-clusters, we minimise the calculations required for macro-cluster maintenance. The micro- clusters themselves are described in such a way that there is no calculation required for the core and shell regions and no separate definition of outer micro-clusters necessary. We demonstrate the ability of the proposed technique to join and separate macro-clusters as they evolve in a fully online manner. There are no other fully online techniques that the authors are aware of and so we compare the tech- nique with popular online / offline hybrid alternatives for accuracy, purity and speed. The technique is then applied to real atmospheric science data streams and used to discover short term, long term and seasonal drift and the effects on anomaly detection. As well as having favourable computational characteristics, the technique can add analytic value over hyper-elliptical methods by character- ising the cluster hyper-shape using Euclidean or fractal shape factors. Because the technique records macro-clusters as graphs, further analytic value accrues from characterising the order, degree, and completeness of the cluster-graphs as they evolve over time

    Finding and tracking multi-density clusters in an online dynamic data stream

    Get PDF
    The file attached to this record is the author's final peer reviewed version.Change is one of the biggest challenges in dynamic stream mining. From a data-mining perspective, adapting and tracking change is desirable in order to understand how and why change has occurred. Clustering, a form of unsupervised learning, can be used to identify the underlying patterns in a stream. Density-based clustering identifies clusters as areas of high density separated by areas of low density. This paper proposes a Multi-Density Stream Clustering (MDSC) algorithm to address these two problems; the multi-density problem and the problem of discovering and tracking changes in a dynamic stream. MDSC consists of two on-line components; discovered, labelled clusters and an outlier buffer. Incoming points are assigned to a live cluster or passed to the outlier buffer. New clusters are discovered in the buffer using an ant-inspired swarm intelligence approach. The newly discovered cluster is uniquely labelled and added to the set of live clusters. Processed data is subject to an ageing function and will disappear when it is no longer relevant. MDSC is shown to perform favourably to state-of-the-art peer stream-clustering algorithms on a range of real and synthetic data-streams. Experimental results suggest that MDSC can discover qualitatively useful patterns while being scalable and robust to noise

    Data Stream Clustering: A Review

    Full text link
    Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing of data streams is arousing interest despite many challenges. Clustering is one of the most suitable methods for real-time data stream processing, because it can be applied with less prior information about the data and it does not need labeled instances. However, data stream clustering differs from traditional clustering in many aspects and it has several challenging issues. Here, we provide information regarding the concepts and common characteristics of data streams, such as concept drift, data structures for data streams, time window models and outlier detection. We comprehensively review recent data stream clustering algorithms and analyze them in terms of the base clustering technique, computational complexity and clustering accuracy. A comparison of these algorithms is given along with still open problems. We indicate popular data stream repositories and datasets, stream processing tools and platforms. Open problems about data stream clustering are also discussed.Comment: Has been accepted for publication in Artificial Intelligence Revie

    A new online clustering approach for data in arbitrary shaped clusters

    Get PDF
    In this paper we demonstrate a new density based clustering technique, CODAS, for online clustering of streaming data into arbitrary shaped clusters. CODAS is a two stage process using a simple local density to initiate micro-clusters which are then combined into clusters. Memory efficiency is gained by not storing or re-using any data. Computational efficiency is gained by using hyper-spherical micro-clusters to achieve a micro-cluster joining technique that is dimensionally independent for speed. The micro-clusters divide the data space in to sub-spaces with a core region and a non-core region. Core regions which intersect define the clusters. A threshold value is used to identify outlier micro-clusters separately from small clusters of unusual data. The cluster information is fully maintained on-line. In this paper we compare CODAS with ELM, DEC, Chameleon, DBScan and Denstream and demonstrate that CODAS achieves comparable results but in a fully on-line and dimensionally scale-able manner

    Advanced analysis and visualisation techniques for atmospheric data

    Get PDF
    Atmospheric science is the study of a large, complex system which is becoming increasingly important to understand. There are many climate models which aim to contribute to that understanding by computational simulation of the atmosphere. To generate these models, and to confirm the accuracy of their outputs, requires the collection of large amounts of data. These data are typically gathered during campaigns lasting a few weeks, during which various sources of measurements are used. Some are ground based, others airborne sondes, but one of the primary sources is from measurement instruments on board aircraft. Flight planning for the numerous sorties is based on pre-determined goals with unpredictable influences, such as weather patterns, and the results of some limited analyses of data from previous sorties. There is little scope for adjusting the flight parameters during the sortie based on the data received due to the large volumes of data and difficulty in processing the data online. The introduction of unmanned aircraft with extended flight durations also requires a team of mission scientists with the added complications of disseminating observations between shifts. Earth’s atmosphere is a non-linear system, whereas the data gathered is sampled at discrete temporal and spatial intervals introducing a source of variance. Clustering data provides a convenient way of grouping similar data while also acknowledging that, for each discrete sample, a minor shift in time and/ or space could produce a range of values which lie within its cluster region. This thesis puts forward a set of requirements to enable the presentation of cluster analyses to the mission scientist in a convenient and functional manner. This will enable in-flight decision making as well as rapid feedback for future flight planning. Current state of the art clustering algorithms are analysed and a solution to all of the proposed requirements is not found. New clustering algorithms are developed to achieve these goals. These novel clustering algorithms are brought together, along with other visualization techniques, into a software package which is used to demonstrate how the analyses can provide information to mission scientists in flight. The ability to carry out offline analyses on historical data, whether to reproduce the online analyses of the current sortie, or to provide comparative analyses from previous missions, is also demonstrated. Methods for offline analyses of historical data prior to continuing the analyses in an online manner are also considered. The original contributions in this thesis are the development of five new clustering algorithms which address key challenges: speed and accuracy for typical hyper-elliptical offline clustering; speed and accuracy for offline arbitrarily shaped clusters; online dynamic and evolving clustering for arbitrary shaped clusters; transitions between offline and online techniques and also the application of these techniques to atmospheric science data analysis

    Dynamic feature selection for clustering high dimensional data streams

    Get PDF
    open access articleChange in a data stream can occur at the concept level and at the feature level. Change at the feature level can occur if new, additional features appear in the stream or if the importance and relevance of a feature changes as the stream progresses. This type of change has not received as much attention as concept-level change. Furthermore, a lot of the methods proposed for clustering streams (density-based, graph-based, and grid-based) rely on some form of distance as a similarity metric and this is problematic in high-dimensional data where the curse of dimensionality renders distance measurements and any concept of “density” difficult. To address these two challenges we propose combining them and framing the problem as a feature selection problem, specifically a dynamic feature selection problem. We propose a dynamic feature mask for clustering high dimensional data streams. Redundant features are masked and clustering is performed along unmasked, relevant features. If a feature's perceived importance changes, the mask is updated accordingly; previously unimportant features are unmasked and features which lose relevance become masked. The proposed method is algorithm-independent and can be used with any of the existing density-based clustering algorithms which typically do not have a mechanism for dealing with feature drift and struggle with high-dimensional data. We evaluate the proposed method on four density-based clustering algorithms across four high-dimensional streams; two text streams and two image streams. In each case, the proposed dynamic feature mask improves clustering performance and reduces the processing time required by the underlying algorithm. Furthermore, change at the feature level can be observed and tracked

    Data stream mining techniques: a review

    Get PDF
    A plethora of infinite data is generated from the Internet and other information sources. Analyzing this massive data in real-time and extracting valuable knowledge using different mining applications platforms have been an area for research and industry as well. However, data stream mining has different challenges making it different from traditional data mining. Recently, many studies have addressed the concerns on massive data mining problems and proposed several techniques that produce impressive results. In this paper, we review real time clustering and classification mining techniques for data stream. We analyze the characteristics of data stream mining and discuss the challenges and research issues of data steam mining. Finally, we present some of the platforms for data stream mining

    On-device modeling of user's social context and familiar places from smartphone-embedded sensor data

    Full text link
    Context modeling and recognition are crucial for adaptive mobile and ubiquitous computing. Context-awareness in mobile environments relies on prompt reactions to context changes. However, current solutions focus on limited context information processed on centralized architectures, risking privacy leakage and lacking personalization. On-device context modeling and recognition are emerging research trends, addressing these concerns. Social interactions and visited locations play significant roles in characterizing daily life scenarios. This paper proposes an unsupervised and lightweight approach to model the user's social context and locations directly on the mobile device. Leveraging the ego-network model, the system extracts high-level, semantic-rich context features from smartphone-embedded sensor data. For the social context, the approach utilizes data on physical and cyber social interactions among users and their devices. Regarding location, it prioritizes modeling the familiarity degree of specific locations over raw location data, such as GPS coordinates and proximity devices. The effectiveness of the proposed approach is demonstrated through three sets of experiments, employing five real-world datasets. These experiments evaluate the structure of social and location ego networks, provide a semantic evaluation of the proposed models, and assess mobile computing performance. Finally, the relevance of the extracted features is showcased by the improved performance of three machine learning models in recognizing daily-life situations. Compared to using only features related to physical context, the proposed approach achieves a 3% improvement in AUROC, 9% in Precision, and 5% in Recall

    Online Arbitrary Shaped Clustering through Correlated Gaussian Functions

    Full text link
    There is no convincing evidence that backpropagation is a biologically plausible mechanism, and further studies of alternative learning methods are needed. A novel online clustering algorithm is presented that can produce arbitrary shaped clusters from inputs in an unsupervised manner, and requires no prior knowledge of the number of clusters in the input data. This is achieved by finding correlated outputs from functions that capture commonly occurring input patterns. The algorithm can be deemed more biologically plausible than model optimization through backpropagation, although practical applicability may require additional research. However, the method yields satisfactory results on several toy datasets on a noteworthy range of hyperparameters.Comment: Corrected uniform distribution range; removed "average" from last sentence in section

    A Short Survey on Data Clustering Algorithms

    Full text link
    With rapidly increasing data, clustering algorithms are important tools for data analytics in modern research. They have been successfully applied to a wide range of domains; for instance, bioinformatics, speech recognition, and financial analysis. Formally speaking, given a set of data instances, a clustering algorithm is expected to divide the set of data instances into the subsets which maximize the intra-subset similarity and inter-subset dissimilarity, where a similarity measure is defined beforehand. In this work, the state-of-the-arts clustering algorithms are reviewed from design concept to methodology; Different clustering paradigms are discussed. Advanced clustering algorithms are also discussed. After that, the existing clustering evaluation metrics are reviewed. A summary with future insights is provided at the end
    corecore