20 research outputs found

    Hierarchical Subspace Clustering

    Get PDF
    It is well-known that traditional clustering methods considering all dimensions of the feature space usually fail in terms of efficiency and effectivity when applied to high-dimensional data. This poor behavior is based on the fact that clusters may not be found in the high-dimensional feature space, although clusters exist in subspaces of the feature space. To overcome these limitations of traditional clustering methods, several methods for subspace clustering have been proposed recently. Subspace clustering algorithms aim at automatically identifying lower dimensional subspaces of the feature space in which clusters exist. There exist two types of subspace clustering algorithms: Algorithms for detecting clusters in axis-parallel subspaces and, as an extension, algorithms for finding clusters in subspaces which are arbitrarily oriented. Generally, the subspace clusters may be hierarchically nested, i.e., several subspace clusters of low dimensionality may form a subspace cluster of higher dimensionality. Since existing subspace clustering methods are not able to detect these complex structures, hierarchical approaches for subspace clustering have to be applied. The goal of this dissertation is to develop new efficient and effective methods for hierarchical subspace clustering by identifying novel challenges for the hierarchical approach and proposing innovative and solid solutions for these challenges. The first Part of this work deals with the analysis of hierarchical subspace clusters in axis-parallel subspaces. Two new methods are proposed that search simultaneously for subspace clusters of arbitrary dimensionality in order to detect complex hierarchies of subspace clusters. Furthermore, a new visualization model of the clustering result by means of a graph representation is provided. In the second Part of this work new methods for hierarchical clustering in arbitrarily oriented subspaces of the feature space are discussed. The so-called correlation clustering can be seen as an extension of axis-parallel subspace clustering. Correlation clustering aims at grouping the data set into subsets, the so-called correlation clusters, such that the objects in the same correlation cluster show uniform attribute correlations. Two new hierarchical approaches are proposed which combine density-based clustering with Principal Component Analysis in order to identify hierarchies of correlation clusters. The last Part of this work addresses the analysis and interpretation of the results obtained from correlation clustering algorithms. A general method is introduced to extract quantitative information on the linear dependencies between the objects of given correlation clusters. Furthermore, these quantitative models can be used to predict the probability that an object is created by one of these models. Both, the efficiency and the effectiveness of the presented techniques are thoroughly analyzed. The benefits over traditional approaches are shown by evaluating the new methods on synthetic as well as real-world test data sets

    FIS-by-Step: Visualization of the Fast Index Scan for Nearest Neighbor Queries

    No full text
    Abstract. Many different index structures have been proposed for spatial databases to support efficient query processing. However, most of these index structures suffer from an exponential dependency in processing time upon the dimensionality of the data objects. Due to this fact, an alternative approach for query processing on high-dimensional data is simply to perform a sequential scan over the entire data set. This approach often yields in lower I/O costs than using a multi-dimensional index. The Fast Index Scan combines these two techniques and optimizes the number and order of blocks which are processed in a single chained I/O operation. In this demonstration we present a tool called FIS-by-Step which visualizes the single I/O operations during a Fast Index Scan while processing a nearest neighbor query. FIS-by-Step assists the development and evaluation of new cost models for the Fast Index Scan by providing user significant information about the applied page access strategy in each step of the algorithm.

    Mining hierarchies of correlation clusters

    No full text
    The detection of correlations between different features in high dimensional data sets is a very important data mining task. These correlations can be arbitrarily complex: One or more features might be correlated with several other features, and both noise features as well as the actual dependencies may be different for different clusters. Therefore, each cluster contains points that are located on a common hyperplane of arbitrary dimensionality in the data space and thus generates a separate, arbitrarily oriented subspace of the original data space. The few recently proposed algorithms designed to uncover these correlation clusters have several disadvantages. In particular, these methods cannot detect correlation clusters of different dimensionality which are nested into each other. The complete hierarchical structure of correlation clusters of varying dimensionality can only be detected by a hierarchical clustering approach. Therefore, we propose the algorithm HiCO (Hierarchical Correlation Ordering), the first hierarchical approach to correlation clustering. The algorithm determines the cluster hierarchy, and visualizes it using correlation diagrams. Several comparative experiments using synthetic and real data sets show the performance and the effectivity of HiCO

    Clustering Multi-represented Objects Using Combination Trees

    No full text
    Abstract. When clustering complex objects, there often exist various feature transformations and thus multiple object representations. To cluster multi-represented objects, dedicated data mining algorithms have been shown to achieve improved results. In this paper, we will introduce combination trees for describing arbitrary semantic relationships which can be used to extend the hierarchical clustering algorithm OPTICS to handle multi-represented data objects. To back up the usability of our proposed method, we present encouraging results on real world data sets.

    M.: Hierarchical Density-Based Clustering for Multi-Represented Objects

    No full text
    In recent years, the complexity of data objects in data mining applications has increased as well as their plain numbers. As a result, there exist various feature transformations and thus multiple object representations. For example, an image can be described by a text annotation, a color histogram and some texture features. To cluster these multi-represented objects, dedicated data mining algorithms have been shown to achieve improved results. In this paper, we will therefore introduce a method for hierarchical density-based clustering of multi-represented objects which is insensitive w.r.t. the choice of parameters. Furthermore, we will introduce a theoretical model that allows us to draw conclusions about the interaction of representations. Additionally, we will show how these conclusions can be used for defining a suitable combination method for multiple representations. To back up the usability of our proposed method, we present encouraging results for clustering a real world image data set that is described by 4 different representations

    Robust, complete, and efficient correlation clustering

    No full text
    Correlation clustering aims at the detection of data points that appear as hyperplanes in the data space and, thus, exhibit common correlations between different subsets of features. Recently proposed methods for correlation clustering usually suffer from several severe drawbacks including poor robustness against noise or parameter settings, incomplete results (i.e. missed clusters), poor usability due to complex input parameters, and poor scalability. In this paper, we propose the novel correlation clustering algorithm COPAC (COrrelation PArtition Clustering) that aims at improved robustness, completeness, usability, and efficiency. Our experimental evaluation empirically shows that COPAC is superior over existing state-of-the-art correlation clustering methods in terms of runtime, accuracy, and completeness of the results

    Reverse k-Nearest Neighbor Search in Dynamic and General Metric Databases

    No full text
    In this paper, we propose an original solution for the general reverse k-nearest neighbor (RkNN) search problem. Compared to the limitations of existing methods for the RkNN search, our approach works on top of any hierarchically organized tree-like index structure and, thus, is applicable to any type of data as long as a metric distance function is defined on the data objects. We will exemplarily show how our approach works on top of the most prevalent index structures for Euclidean and metric data, the R-Tree and the M-Tree, respectively. Our solution is applicable for arbitrary values of k and can also be applied in dynamic environments where updates of the database frequently occur. Although being the most general solution for the RkNN problem, our solution outperforms existing methods in terms of query execution times because it exploits different strategies for pruning false drops and identifying true hits as soon as possible

    P.: “Online Hierarchical Clustering in a Data Warehouse Environment

    No full text
    Many important industrial applications rely on data mining methods to uncover patterns and trends in large data warehouse environments. Since a data warehouse is typically updated periodically in a batch mode, the mined patterns have to be updated as well. This requires not only accuracy from data mining methods but also fast availability of up-to-date knowledge, particularly in the presence of a heavy update load. To cope with this problem, we propose the use of online data mining algorithms which permanently store the discovered knowledge in suitable data structures and enable an efficient adaptation of these structures after insertions and deletions on the raw data. In this paper, we demonstrate how hierarchical clustering methods can be reformulated as online algorithms based on the hierarchical clustering method OPTICS, using a density estimator for data grouping. We also discuss how this algorithmic schema can be specialized for efficient online single-link clustering. A broad experimental evaluation demonstrates that the efficiency is superior with significant speed-up factors even for large bulk insertions and deletions.

    Clustering Multi-Represented Objects Using Combination Trees

    No full text
    Abstract. When clustering complex objects, there often exist various feature transformations and thus multiple object representations. To cluster multi-represented objects, dedicated data mining algorithms have been shown to achieve improved results. In this paper, we will introduce combination trees for describing arbitrary semantic relationships which can be used to extend the hierarchical clustering algorithm OPTICS to handle multi-represented data objects. To back up the usability of our proposed method, we present encouraging results on real world data sets.
    corecore