99 research outputs found

    A robust hierarchical clustering for georeferenced data

    Get PDF
    The detection of spatially contiguous clusters is a relevant task in geostatistics since near located observations might have similar features than distant ones. Spatially compact groups can also improve clustering results interpretation according to the different detected subregions. In this paper, we propose a robust metric approach to neutralize the effect of possible outliers, i.e. an exponential transformation of a dissimilarity measure between each pair of locations based on non-parametric kernel estimator of the direct and cross variograms (Fouedjio, 2016) and on a different bandwidth identification, suitable for agglomerative hierarchical clustering techniques applied to data indexed by geographical coordinates. Simulation results are very promising showing very good performances of our proposed metric with respect to the baseline ones. Finally, the new clustering approach is applied to two real-word data sets, both giving locations and top soil heavy metal concentrations

    Robust hierarchical k-center clustering

    Get PDF
    One of the most popular and widely used methods for data clustering is hierarchical clustering. This clustering technique has proved useful to reveal interesting structure in the data in several applications ranging from computational biology to computer vision. Robustness is an important feature of a clustering technique if we require the clustering to be stable against small perturbations in the input data. In most applications, getting a clustering output that is robust against adversarial outliers or stochastic noise is a necessary condition for the applicability and effectiveness of the clustering technique. This is even more critical in hierarchical clustering where a small change at the bottom of the hierarchy may propagate all the way through to the top. Despite all the previous work [2, 3, 6, 8], our theoretical understanding of robust hierarchical clustering is still limited and several hierarchical clustering algorithms are not known to satisfy such robustness properties. In this paper, we study the limits of robust hierarchical k-center clustering by introducing the concept of universal hierarchical clustering and provide (almost) tight lower and upper bounds for the robust hierarchical k-center clustering problem with outliers and variants of the stochastic clustering problem. Most importantly we present a constant-factor approximation for optimal hierarchical k-center with at most z outliers using a universal set of at most O(z2) set of outliers and show that this result is tight. Moreover we show the necessity of using a universal set of outliers in order to compute an approximately optimal hierarchical k-center with a diffierent set of outliers for each k

    Robust hierarchical clustering for novelty identification in sensor networks: With applications to industrial systems

    Get PDF
    The paper proposes a new, robust cluster-based classification technique for Novelty Identification in sensor networks that possess a high degree of correlation among data streams. During normal operation, a uniform cluster across objects (sensors) is generated that indicates the absence of novelties. Conversely, in presence of novelty, the associated sensor is clustered distinctly from the remaining sensors, thereby isolating the data stream which exhibits the novelty. It is shown how small perturbations (stemming from noise, for instance) can affect the performance of traditional clustering methods, and that the proposed variant exhibits a robustness to such influences. Moreover, the proposed method is compared with a recently reported technique, and shown that it performs 365% faster computationally. To provide an application case study, the technique is used to identify emerging fault modes in a sensor network on a sub-15MW industrial gas turbine in presence of other abrupt, but normal changes that visually might otherwise be interpreted as malfunctions

    Estimation of instrinsic dimension via clustering

    Full text link
    The problem of estimating the intrinsic dimension of a set of points in high dimensional space is a critical issue for a wide range of disciplines, including genomics, finance, and networking. Current estimation techniques are dependent on either the ambient or intrinsic dimension in terms of computational complexity, which may cause these methods to become intractable for large data sets. In this paper, we present a clustering-based methodology that exploits the inherent self-similarity of data to efficiently estimate the intrinsic dimension of a set of points. When the data satisfies a specified general clustering condition, we prove that the estimated dimension approaches the true Hausdorff dimension. Experiments show that the clustering-based approach allows for more efficient and accurate intrinsic dimension estimation compared with all prior techniques, even when the data does not conform to obvious self-similarity structure. Finally, we present empirical results which show the clustering-based estimation allows for a natural partitioning of the data points that lie on separate manifolds of varying intrinsic dimension

    Incremental Clustering: The Case for Extra Clusters

    Full text link
    The explosion in the amount of data available for analysis often necessitates a transition from batch to incremental clustering methods, which process one element at a time and typically store only a small subset of the data. In this paper, we initiate the formal analysis of incremental clustering methods focusing on the types of cluster structure that they are able to detect. We find that the incremental setting is strictly weaker than the batch model, proving that a fundamental class of cluster structures that can readily be detected in the batch setting is impossible to identify using any incremental method. Furthermore, we show how the limitations of incremental clustering can be overcome by allowing additional clusters

    Adaptive Multiscale Weighted Permutation Entropy for Rolling Bearing Fault Diagnosis

    Get PDF
    © 2020 The Author(s). This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.Bearing vibration signals contain non-linear and non-stationary features due to instantaneous variations in the operation of rotating machinery. It is important to characterize and analyze the complexity change of the bearing vibration signals so that bearing health conditions can be accurately identified. Entropy measures are non-linear indicators that are applicable to the time series complexity analysis for machine fault diagnosis. In this paper, an improved entropy measure, termed Adaptive Multiscale Weighted Permutation Entropy (AMWPE), is proposed. Then, a new rolling bearing fault diagnosis method is developed based on the AMWPE and multi-class SVM. For comparison, experimental bearing data are analyzed using the AMWPE, compared with the conventional entropy measures, where a multi-class SVM is adopted for fault type classification. Moreover, the robustness of different entropy measures is further studied for the analysis of noisy signals with various Signal-to-Noise Ratios (SNRs). The experimental results have demonstrated the effectiveness of the proposed method in fault diagnosis of rolling bearing under different fault types, severity degrees, and SNR levels.Peer reviewedFinal Accepted Versio

    Clustering Partially Observed Graphs via Convex Optimization

    Get PDF
    This paper considers the problem of clustering a partially observed unweighted graph---i.e., one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organize the nodes into disjoint clusters so that there is relatively dense (observed) connectivity within clusters, and sparse across clusters. We take a novel yet natural approach to this problem, by focusing on finding the clustering that minimizes the number of "disagreements"---i.e., the sum of the number of (observed) missing edges within clusters, and (observed) present edges across clusters. Our algorithm uses convex optimization; its basis is a reduction of disagreement minimization to the problem of recovering an (unknown) low-rank matrix and an (unknown) sparse matrix from their partially observed sum. We evaluate the performance of our algorithm on the classical Planted Partition/Stochastic Block Model. Our main theorem provides sufficient conditions for the success of our algorithm as a function of the minimum cluster size, edge density and observation probability; in particular, the results characterize the tradeoff between the observation probability and the edge density gap. When there are a constant number of clusters of equal size, our results are optimal up to logarithmic factors.Comment: This is the final version published in Journal of Machine Learning Research (JMLR). Partial results appeared in International Conference on Machine Learning (ICML) 201
    • …
    corecore