758 research outputs found

    Robust M-Estimation Based Bayesian Cluster Enumeration for Real Elliptically Symmetric Distributions

    Full text link
    Robustly determining the optimal number of clusters in a data set is an essential factor in a wide range of applications. Cluster enumeration becomes challenging when the true underlying structure in the observed data is corrupted by heavy-tailed noise and outliers. Recently, Bayesian cluster enumeration criteria have been derived by formulating cluster enumeration as maximization of the posterior probability of candidate models. This article generalizes robust Bayesian cluster enumeration so that it can be used with any arbitrary Real Elliptically Symmetric (RES) distributed mixture model. Our framework also covers the case of M-estimators that allow for mixture models, which are decoupled from a specific probability distribution. Examples of Huber's and Tukey's M-estimators are discussed. We derive a robust criterion for data sets with finite sample size, and also provide an asymptotic approximation to reduce the computational cost at large sample sizes. The algorithms are applied to simulated and real-world data sets, including radar-based person identification, and show a significant robustness improvement in comparison to existing methods

    Robust and Distributed Cluster Enumeration and Object Labeling

    Get PDF
    This dissertation contributes to the area of cluster analysis by providing principled methods to determine the number of data clusters and cluster memberships, even in the presence of outliers. The main theoretical contributions are summarized in two theorems on Bayesian cluster enumeration based on modeling the data as a family of Gaussian and t distributions. Real-world applicability is demonstrated by considering advanced signal processing applications, such as distributed camera networks and radar-based person identification. In particular, a new cluster enumeration criterion, which is applicable to a broad class of data distributions, is derived by utilizing Bayes' theorem and asymptotic approximations. This serves as a starting point when deriving cluster enumeration criteria for specific data distributions. Along this line, a Bayesian cluster enumeration criterion is derived by modeling the data as a family of multivariate Gaussian distributions. In real-world applications, the observed data is often subject to heavy tailed noise and outliers which obscure the true underlying structure of the data. Consequently, estimating the number of data clusters becomes challenging. To this end, a robust cluster enumeration criterion is derived by modeling the data as a family of multivariate t distributions. The family of t distributions is flexible by variation of its degree of freedom parameter (ν) and it contains, as special cases, the heavy tailed Cauchy for ν = 1, and the Gaussian distribution for ν → ∞. Given that ν is sufficiently small, the robust criterion accounts for outliers by giving them less weight in the objective function. A further contribution of this dissertation lies in refining the penalty terms of both the robust and Gaussian criterion for the finite sample regime. The derived cluster enumeration criteria require a clustering algorithm that partitions the data according to the number of clusters specified by each candidate model and provides an estimate of cluster parameters. Hence, a model-based unsupervised learning method is applied to partition the data prior to the calculation of an enumeration criterion, resulting in a two-step algorithm. The proposed algorithm provides a unified framework for the estimation of the number of clusters and cluster memberships. The developed algorithms are applied to two advanced signal processing use cases. Specifically, the cluster enumeration criteria are extended to a distributed sensor network setting by proposing two distributed and adaptive Bayesian cluster enumeration algorithms. The proposed algorithms are applied to a camera network use case, where the task is to estimate the number of pedestrians based on streaming-in data collected by multiple cameras filming a non-stationary scene from different viewpoints. A further research focus of this dissertation is the cluster membership assignment of individual data points and their associated cluster labels given that the number of clusters is either prespecified by the user or estimated by one of the methods described earlier. Solving this task is required in a broad range of applications, such as distributed sensor networks and radar-based person identification. For this purpose, an adaptive joint object labeling and tracking algorithm is proposed and applied to a real data use case of pedestrian labeling in a calibration-free multi-object multi-camera setup with low video resolution and frequent object occlusions. The proposed algorithm is well suited for ad hoc networks, as it requires neither registration of camera views nor a fusion center. Finally, a joint cluster enumeration and labeling algorithm is proposed to deal with the combined problem of estimating the number of clusters and cluster memberships at the same time. The proposed algorithm is applied to person labeling in a real data application of radar-based person identification without prior information on the number of individuals. It achieves comparable performance to a supervised approach that requires knowledge of the number of persons and a considerable amount of training data with known cluster labels. The proposed unsupervised method is advantageous in the considered application of smart assisted living, as it extracts the missing information from the data. Based on these examples, and, also considering the comparably low computational cost, we conjuncture that the proposed methods provide a useful set of robust cluster analysis tools for data science with many potential application areas, not only in the area of engineering

    Online Spectral Clustering on Network Streams

    Get PDF
    Graph is an extremely useful representation of a wide variety of practical systems in data analysis. Recently, with the fast accumulation of stream data from various type of networks, significant research interests have arisen on spectral clustering for network streams (or evolving networks). Compared with the general spectral clustering problem, the data analysis of this new type of problems may have additional requirements, such as short processing time, scalability in distributed computing environments, and temporal variation tracking. However, to design a spectral clustering method to satisfy these requirements certainly presents non-trivial efforts. There are three major challenges for the new algorithm design. The first challenge is online clustering computation. Most of the existing spectral methods on evolving networks are off-line methods, using standard eigensystem solvers such as the Lanczos method. It needs to recompute solutions from scratch at each time point. The second challenge is the parallelization of algorithms. To parallelize such algorithms is non-trivial since standard eigen solvers are iterative algorithms and the number of iterations can not be predetermined. The third challenge is the very limited existing work. In addition, there exists multiple limitations in the existing method, such as computational inefficiency on large similarity changes, the lack of sound theoretical basis, and the lack of effective way to handle accumulated approximate errors and large data variations over time. In this thesis, we proposed a new online spectral graph clustering approach with a family of three novel spectrum approximation algorithms. Our algorithms incrementally update the eigenpairs in an online manner to improve the computational performance. Our approaches outperformed the existing method in computational efficiency and scalability while retaining competitive or even better clustering accuracy. We derived our spectrum approximation techniques GEPT and EEPT through formal theoretical analysis. The well established matrix perturbation theory forms a solid theoretic foundation for our online clustering method. We facilitated our clustering method with a new metric to track accumulated approximation errors and measure the short-term temporal variation. The metric not only provides a balance between computational efficiency and clustering accuracy, but also offers a useful tool to adapt the online algorithm to the condition of unexpected drastic noise. In addition, we discussed our preliminary work on approximate graph mining with evolutionary process, non-stationary Bayesian Network structure learning from non-stationary time series data, and Bayesian Network structure learning with text priors imposed by non-parametric hierarchical topic modeling

    A review of clustering techniques and developments

    Full text link
    © 2017 Elsevier B.V. This paper presents a comprehensive study on clustering: exiting methods and developments made at various times. Clustering is defined as an unsupervised learning where the objects are grouped on the basis of some similarity inherent among them. There are different methods for clustering the objects such as hierarchical, partitional, grid, density based and model based. The approaches used in these methods are discussed with their respective states of art and applicability. The measures of similarity as well as the evaluation criteria, which are the central components of clustering, are also presented in the paper. The applications of clustering in some fields like image segmentation, object and character recognition and data mining are highlighted

    Clustering-based approaches to SAGE data mining

    Get PDF
    Serial analysis of gene expression (SAGE) is one of the most powerful tools for global gene expression profiling. It has led to several biological discoveries and biomedical applications, such as the prediction of new gene functions and the identification of biomarkers in human cancer research. Clustering techniques have become fundamental approaches in these applications. This paper reviews relevant clustering techniques specifically designed for this type of data. It places an emphasis on current limitations and opportunities in this area for supporting biologically-meaningful data mining and visualisation
    • …
    corecore