1 research outputs found

    Generalized Projected Clustering in High-Dimensional Data Streams

    No full text
    Abstract. Clustering is to identify densely populated subgroups in data, while correlation analysis is to find the dependency between the attributes of the data set. In this paper, we combine the two techniques in the domain of data streams, i.e. dense subgroup of data points sharing strong correlation. Such correlation connected cluster [11] is meaningful in many areas, e.g., in E-business, the positive correlations indicate sets of similar purchase patterns. However, detecting such clusters in streaming data is difficult: In high-dimensional streams, the inherent sparsity means that the correlations are local for subgroups; the correlation itself can be of arbitrarily complex direction, that is a set of attributes are dependent on another set. We present a novel method ACID to overcome these problems in detecting correlation connected clusters in data streams. The method incorporates principal component analysis (PCA), streaming cluster feature vectors (SCF), and SCF-Tree (a variant of CF-Tree). It has high scalability on both the size of stream and the dimension of data, and is robust against noise. Extensive experiments on both synthetic and real data are done to show the efficiency and effectiveness of our approach.
    corecore