483 research outputs found

    A soft hierarchical algorithm for the clustering of multiple bioactive chemical compounds

    Get PDF
    Most of the clustering methods used in the clustering of chemical structures such as Wards, Group Average, K- means and Jarvis-Patrick, are known as hard or crisp as they partition a dataset into strictly disjoint subsets; and thus are not suitable for the clustering of chemical structures exhibiting more than one activity. Although, fuzzy clustering algorithms such as fuzzy c-means provides an inherent mechanism for the clustering of overlapping structures (objects) but this potential of the fuzzy methods which comes from its fuzzy membership functions have not been utilized effectively. In this work a fuzzy hierarchical algorithm is developed which provides a mechanism not only to benefit from the fuzzy clustering process but also to get advantage of the multiple membership function of the fuzzy clustering. The algorithm divides each and every cluster, if its size is larger than a pre-determined threshold, into two sub clusters based on the membership values of each structure. A structure is assigned to one or both the clusters if its membership value is very high or very similar respectively. The performance of the algorithm is evaluated on two bench mark datasets and a large dataset of compound structures derived from MDL MDDR database. The results of the algorithm show significant improvement in comparison to a similar implementation of the hard c-means algorithm

    Clustering Single-cell RNA-sequencing Data based on Matching Clusters Structures

    Get PDF
    Single-cell sequencing technology can generate RNA-sequencing data at the single cell level, and one important single-cell RNA-sequencing data analysis method is to identify their cell types without supervised information. Clustering is an unsupervised approach that can help find new insights into biology especially for exploring the biological functions of specific cell type. However, it is challenging for traditional clustering methods to obtain high-quality cell type recognition results. In this research, we propose a novel Clustering method based on Matching Clusters Structures (MCSC) for identifying cell types among single-cell RNA-sequencing data. Firstly, MCSC obtains two different groups of clustering results from the same K-means algorithm because its initial centroids are randomly selected. Then, for one group, MCSC uses shared nearest neighbour information to calculate a label transition matrix, which denotes label transition probability between any two initial clusters. Each initial cluster may be reassigned if merging results after label transition satisfy a consensus function that maximizes structural matching degree of two different groups of clustering results. In essence, the MCSC may be interpreted as a label training process. We evaluate the proposed MCSC with five commonly used datasets and compare MCSC with several classical and state-of-the-art algorithms. The experimental results show that MCSC outperform other algorithms

    Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons

    Full text link
    The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of this problem into a multiplication of sparse matrices. Both the encoding and sparse matrix product are performed in a way that minimizes data movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of a set of large samples of genomes. This task is a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using largescale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains

    The Application of Spectral Clustering in Drug Discovery

    Get PDF
    The application of clustering algorithms to chemical datasets is well established and has been reviewed extensively. Recently, a number of ‘modern’ clustering algorithms have been reported in other fields. One example is spectral clustering, which has yielded promising results in areas such as protein library analysis. The term spectral clustering is used to describe any clustering algorithm that utilises the eigenpairs of a matrix as the basis for partitioning a dataset. This thesis describes the development and optimisation of a non-overlapping spectral clustering method that is based upon a study by Brewer. The initial version of the spectral clustering algorithm was closely related to Brewer’s method and used a full matrix diagonalisation procedure to identify the eigenpairs of an input matrix. This spectral clustering method was compared to the k-means and Ward’s algorithms, producing encouraging results, for example, when coupled with extended connectivity fingerprints, this method outperformed the other clustering algorithms according to the QCI measure. Although the spectral clustering algorithm showed promising results, its operational costs restricted its application to small datasets. Hence, the method was optimised in successive studies. Firstly, the effect of matrix sparsity on the spectral clustering was examined and showed that spectral clustering with sparse input matrices can lead to an improvement in the results. Despite this improvement, the costs of spectral clustering remained prohibitive, so the full matrix diagonalisation procedure was replaced with the Lanczos algorithm that has lower associated costs, as suggested by Brewer. This method led to a significant decrease in the computational costs when identifying a small number of clusters, however a number of issues remained; leading to the adoption of a SVD-based eigendecomposition method. The SVD-based algorithm was shown to be highly efficient, accurate and scalable through a number of studies

    ADAPTIVE SEARCH AND THE PRELIMINARY DESIGN OF GAS TURBINE BLADE COOLING SYSTEMS

    Get PDF
    This research concerns the integration of Adaptive Search (AS) technique such as the Genetic Algorithms (GA) with knowledge based software to develop a research prototype of an Adaptive Search Manager (ASM). The developed approach allows to utilise both quantitative and qualitative information in engineering design decision making. A Fuzzy Expert System manipulates AS software within the design environment concerning the preliminary design of gas turbine blade cooling systems. Steady state cooling hole geometry models have been developed for the project in collaboration with Rolls Royce plc. The research prototype of ASM uses a hybrid of Adaptive Restricted Tournament Selection (ARTS) and Knowledge Based Hill Climbing (KBHC) to identify multiple "good" design solutions as potential design options. ARTS is a GA technique that is particularly suitable for real world problems having multiple sub-optima. KBHC uses information gathered during the ARTS search as well as information from the designer to perform a deterministic hill climbing. Finally, a local stochastic hill climbing fine tunes the "good" designs. Design solution sensitivity, design variable sensitivities and constraint sensitivities are calculated following Taguchi's methodology, which extracts sensitivity information with a very small number of model evaluations. Each potential design option is then qualitatively evaluated separately for manufacturability, choice of materials and some designer's special preferences using the knowledge of domain experts. In order to guarantee that the qualitative evaluation module can evaluate any design solution from the entire design space with a reasonably small number of rules, a novel knowledge representation technique is developed. The knowledge is first separated in three categories: inter-variable knowledge, intra-variable knowledge and heuristics. Inter-variable knowledge and intra-variable knowledge are then integrated using a concept of compromise. Information about the "good" design solutions is presented to the designer through a designer's interface for decision support.Rolls Royce plc., Bristol (UK

    Clasificación molecular de los componentes de los aceites de Eucalyptus camaldulensis y de Mentha pulegium

    Get PDF
    Eucalyptus and Mentha remain flavours in agro-food manufacturing. Oils and components present antifungal potency vs . decay of fruits; E. camaldulensis is led by 1,8-cineole and a -pinene, unlike M. pulegium , which is led by pulegone. The antifungal activity of M. pulegium is three times more frequent than that of E. camaldulensis . The phytochemicals present synergy. Categorization is recommended on the basis of information entropy . The quantity of C-C double bonds, O-atoms and cycles cluster structures. The procedure undergoes a combinatorial upsurge. Nevertheless, following equipartition conjecture, one gets a criterion for selection. Entropy allows clustering phytochemicals according to cluster and principal component analyses. In the periodic classification, components in the same column show similar features. Phytochemicals in the same row present utmost similarity.El Eucalyptus y la Mentha se siguen utilizando como sabores en la fabricación agroalimentaria. Sus aceites y compo nentes presentan actividad antifúngica frente a la descomposición de frutas; E. camaldulensis está enca bezado por 1,8-cineol y a -pineno, pero M. pulegium lo está por pulegona. La actividad antifúngica de M. pulegium resulta tres veces la de E. camaldulensis . Los fitoquímicos presentan sinergia. Se recomienda la categorización sobre la base de la entropía informacional . La cantidad de dobles enlaces C-C, átomos de O y ciclos agrupa las estructuras. El procedimiento sufre una explosión combinatoria. Sin embargo, según la conjetura de equipartición , se puede obtener un criterio para la selección. La entropía permite agrupar fitoquímicos de acuerdo con el análisis de grupo y los componentes principales. En la clasificación periódica, los componentes en la misma columna muestran características parecidas. Los fitoquímicos presentan también en la misma fila la máxima semejanza.Ciencias Experimentale

    Correcting Knowledge Base Assertions

    Get PDF
    The usefulness and usability of knowledge bases (KBs) is often limited by quality issues. One common issue is the presence of erroneous assertions, often caused by lexical or semantic confusion. We study the problem of correcting such assertions, and present a general correction framework which combines lexical matching, semantic embedding, soft constraint mining and semantic consistency checking. The framework is evaluated using DBpedia and an enterprise medical KB

    Clustering for 2D chemical structures

    Get PDF
    The clustering of chemical structures is important and widely used in several areas of chemoinformatics. A little-discussed aspect of clustering is standardization, it ensures all descriptors in a chemical representation make a comparable contribution to the measurement of similarity. The initial study compares the effectiveness of seven different standardization procedures that have been suggested previously, the results were also compared with unstandardized datasets. It was found that no one standardization method offered consistently the best performance. Comparative studies of clustering effectiveness are helpful in providing suitability and guidelines of different methods. In order to examine the suitability of different clustering methods for the application in chemoinformatics, especially those had not previously been applied to chemoinformatics, the second piece of study carries out an effectiveness comparison of nine clustering methods. However, the result revealed that it is unlikely that a single clustering method can provide consistently the best partition under all circumstances. Consensus clustering is a technique to combine multiple input partitions of the same set of objects to achieve a single clustering that is expected to provide a more robust and more generally effective representation of the partitions that are submitted. The third piece of study reports the use of seven different consensus clustering methods which had not previously been used on sets of chemical compounds represented by 2D fingerprints. Their effectiveness was compared with some traditional clustering methods discussed in the second study. It was observed that no consistently best consensus clustering method was found
    corecore