3 research outputs found

    A new framework for clustering

    Get PDF
    The difficulty of clustering and the variety of clustering methods suggest the need for a theoretical study of clustering. Using the idea of a standard statistical framework, we propose a new framework for clustering. For a well-defined clustering goal we assume that the data to be clustered come from an underlying distribution and we aim to find a high-density cluster tree. We regard this tree as a parameter of interest for the underlying distribution. However, it is not obvious how to determine a connected subset in a discrete distribution whose support is located in a Euclidean space. Building a cluster tree for such a distribution is an open problem and presents interesting conceptual and computational challenges. We solve this problem using graph-based approaches and further parameterize clustering using the high-density cluster tree and its extension. Motivated by the connection between clustering outcomes and graphs, we propose a graph family framework. This framework plays an important role in our clustering framework. A direct application of the graph family framework is a new cluster-tree distance measure. This distance measure can be written as an inner product or kernel. It makes our clustering framework able to perform statistical assessment of clustering via simulation. Other applications such as a method for integrating partitions into a cluster tree and methods for cluster tree averaging and bagging are also derived from the graph family framework

    Clustering for 2D chemical structures

    Get PDF
    The clustering of chemical structures is important and widely used in several areas of chemoinformatics. A little-discussed aspect of clustering is standardization, it ensures all descriptors in a chemical representation make a comparable contribution to the measurement of similarity. The initial study compares the effectiveness of seven different standardization procedures that have been suggested previously, the results were also compared with unstandardized datasets. It was found that no one standardization method offered consistently the best performance. Comparative studies of clustering effectiveness are helpful in providing suitability and guidelines of different methods. In order to examine the suitability of different clustering methods for the application in chemoinformatics, especially those had not previously been applied to chemoinformatics, the second piece of study carries out an effectiveness comparison of nine clustering methods. However, the result revealed that it is unlikely that a single clustering method can provide consistently the best partition under all circumstances. Consensus clustering is a technique to combine multiple input partitions of the same set of objects to achieve a single clustering that is expected to provide a more robust and more generally effective representation of the partitions that are submitted. The third piece of study reports the use of seven different consensus clustering methods which had not previously been used on sets of chemical compounds represented by 2D fingerprints. Their effectiveness was compared with some traditional clustering methods discussed in the second study. It was observed that no consistently best consensus clustering method was found

    Clustering tendency in chemical classifications

    No full text
    corecore