68 research outputs found

    Fast Spectral Clustering Using Autoencoders and Landmarks

    Full text link
    In this paper, we introduce an algorithm for performing spectral clustering efficiently. Spectral clustering is a powerful clustering algorithm that suffers from high computational complexity, due to eigen decomposition. In this work, we first build the adjacency matrix of the corresponding graph of the dataset. To build this matrix, we only consider a limited number of points, called landmarks, and compute the similarity of all data points with the landmarks. Then, we present a definition of the Laplacian matrix of the graph that enable us to perform eigen decomposition efficiently, using a deep autoencoder. The overall complexity of the algorithm for eigen decomposition is O(np)O(np), where nn is the number of data points and pp is the number of landmarks. At last, we evaluate the performance of the algorithm in different experiments.Comment: 8 Pages- Accepted in 14th International Conference on Image Analysis and Recognitio

    Integration of curated databases to identify genotype-phenotype associations

    Get PDF
    BACKGROUND: The ability to rapidly characterize an unknown microorganism is critical in both responding to infectious disease and biodefense. To do this, we need some way of anticipating an organism's phenotype based on the molecules encoded by its genome. However, the link between molecular composition (i.e. genotype) and phenotype for microbes is not obvious. While there have been several studies that address this challenge, none have yet proposed a large-scale method integrating curated biological information. Here we utilize a systematic approach to discover genotype-phenotype associations that combines phenotypic information from a biomedical informatics database, GIDEON, with the molecular information contained in National Center for Biotechnology Information's Clusters of Orthologous Groups database (NCBI COGs). RESULTS: Integrating the information in the two databases, we are able to correlate the presence or absence of a given protein in a microbe with its phenotype as measured by certain morphological characteristics or survival in a particular growth media. With a 0.8 correlation score threshold, 66% of the associations found were confirmed by the literature and at a 0.9 correlation threshold, 86% were positively verified. CONCLUSION: Our results suggest possible phenotypic manifestations for proteins biochemically associated with sugar metabolism and electron transport. Moreover, we believe our approach can be extended to linking pathogenic phenotypes with functionally related proteins

    Clustering of Pseudomonas aeruginosa transcriptomes from planktonic cultures, developing and mature biofilms reveals distinct expression profiles

    Get PDF
    BACKGROUND: Pseudomonas aeruginosa is a genetically complex bacterium which can adopt and switch between a free-living or biofilm lifestyle, a versatility that enables it to thrive in many different environments and contributes to its success as a human pathogen. RESULTS: Transcriptomes derived from growth states relevant to the lifestyle of P. aeruginosa were clustered using three different methods (K-means, K-means spectral and hierarchical clustering). The culture conditions used for this study were; biofilms incubated for 8, 14, 24 and 48 hrs, and planktonic culture (logarithmic and stationary phase). This cluster analysis revealed the existence and provided a clear illustration of distinct expression profiles present in the dataset. Moreover, it gave an insight into which genes are up-regulated in planktonic, developing biofilm and confluent biofilm states. In addition, this analysis confirmed the contribution of quorum sensing (QS) and RpoS regulated genes to the biofilm mode of growth, and enabled the identification of a 60.69 Kbp region of the genome associated with stationary phase growth (stationary phase planktonic culture and confluent biofilms). CONCLUSION: This is the first study to use clustering to separate a large P. aeruginosa microarray dataset consisting of transcriptomes obtained from diverse conditions relevant to its growth, into different expression profiles. These distinct expression profiles not only reveal novel aspects of P. aeruginosa gene expression but also provide a growth specific transcriptomic reference dataset for the research community

    Network modeling of patients' biomolecular profiles for clinical phenotype/outcome prediction

    Get PDF
    Methods for phenotype and outcome prediction are largely based on inductive supervised models that use selected biomarkers to make predictions, without explicitly considering the functional relationships between individuals. We introduce a novel network-based approach named Patient-Net (P-Net) in which biomolecular profiles of patients are modeled in a graph-structured space that represents gene expression relationships between patients. Then a kernel-based semi-supervised transductive algorithm is applied to the graph to explore the overall topology of the graph and to predict the phenotype/clinical outcome of patients. Experimental tests involving several publicly available datasets of patients afflicted with pancreatic, breast, colon and colorectal cancer show that our proposed method is competitive with state-of-the-art supervised and semi-supervised predictive systems. Importantly, P-Net also provides interpretable models that can be easily visualized to gain clues about the relationships between patients, and to formulate hypotheses about their stratification

    jClust: a clustering and visualization toolbox

    Get PDF
    jClust is a user-friendly application which provides access to a set of widely used clustering and clique finding algorithms. The toolbox allows a range of filtering procedures to be applied and is combined with an advanced implementation of the Medusa interactive visualization module. These implemented algorithms are k-Means, Affinity propagation, Bron–Kerbosch, MULIC, Restricted neighborhood search cluster algorithm, Markov clustering and Spectral clustering, while the supported filtering procedures are haircut, outside–inside, best neighbors and density control operations. The combination of a simple input file format, a set of clustering and filtering algorithms linked together with the visualization tool provides a powerful tool for data analysis and information extraction

    SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>An important problem in genomics is the automatic inference of groups of homologous proteins from pairwise sequence similarities. Several approaches have been proposed for this task which are "local" in the sense that they assign a protein to a cluster based only on the distances between that protein and the other proteins in the set. It was shown recently that global methods such as spectral clustering have better performance on a wide variety of datasets. However, currently available implementations of spectral clustering methods mostly consist of a few loosely coupled Matlab scripts that assume a fair amount of familiarity with Matlab programming and hence they are inaccessible for large parts of the research community.</p> <p>Results</p> <p>SCPS (Spectral Clustering of Protein Sequences) is an efficient and user-friendly implementation of a spectral method for inferring protein families. The method uses only pairwise sequence similarities, and is therefore practical when only sequence information is available. SCPS was tested on difficult sets of proteins whose relationships were extracted from the SCOP database, and its results were extensively compared with those obtained using other popular protein clustering algorithms such as TribeMCL, hierarchical clustering and connected component analysis. We show that SCPS is able to identify many of the family/superfamily relationships correctly and that the quality of the obtained clusters as indicated by their F-scores is consistently better than all the other methods we compared it with. We also demonstrate the scalability of SCPS by clustering the entire SCOP database (14,183 sequences) and the complete genome of the yeast <it>Saccharomyces cerevisiae </it>(6,690 sequences).</p> <p>Conclusions</p> <p>Besides the spectral method, SCPS also implements connected component analysis and hierarchical clustering, it integrates TribeMCL, it provides different cluster quality tools, it can extract human-readable protein descriptions using GI numbers from NCBI, it interfaces with external tools such as BLAST and Cytoscape, and it can produce publication-quality graphical representations of the clusters obtained, thus constituting a comprehensive and effective tool for practical research in computational biology. Source code and precompiled executables for Windows, Linux and Mac OS X are freely available at <url>http://www.paccanarolab.org/software/scps</url>.</p

    Mixture of experts models to exploit global sequence similarity on biomolecular sequence labeling

    Get PDF
    Background: Identification of functionally important sites in biomolecular sequences has broad applications ranging from rational drug design to the analysis of metabolic and signal transduction networks. Experimental determination of such sites lags far behind the number of known biomolecular sequences. Hence, there is a need to develop reliable computational methods for identifying functionally important sites from biomolecular sequences. Results: We present a mixture of experts approach to biomolecular sequence labeling that takes into account the global similarity between biomolecular sequences. Our approach combines unsupervised and supervised learning techniques. Given a set of sequences and a similarity measure defined on pairs of sequences, we learn a mixture of experts model by using spectral clustering to learn the hierarchical structure of the model and by using bayesian techniques to combine the predictions of the experts. We evaluate our approach on two biomolecular sequence labeling problems: RNA-protein and DNA-protein interface prediction problems. The results of our experiments show that global sequence similarity can be exploited to improve the performance of classifiers trained to label biomolecular sequence data. Conclusion: The mixture of experts model helps improve the performance of machine learning methods for identifying functionally important sites in biomolecular sequences.This is a proceeding from IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 10 (2009): S4, doi: 10.1186/1471-2105-10-S4-S4. Posted with permission.</p

    A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative.

    Get PDF
    Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients’ predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm’s pa- rameters and data-related modeling choices are also both crucial and challenging

    Reliable transfer of transcriptional gene regulatory networks between taxonomically related organisms

    Get PDF
    Baumbach J, Rahmann S, Tauch A. Reliable transfer of transcriptional gene regulatory networks between taxonomically related organisms. BMC Systems Biology. 2009;3(1):8.Background: Transcriptional regulation of gene activity is essential for any living organism. Transcription factors therefore recognize specific binding sites within the DNA to regulate the expression of particular target genes. The genome-scale reconstruction of the emerging regulatory networks is important for biotechnology and human medicine but cost-intensive, time-consuming, and impossible to perform for any species separately. By using bioinformatics methods one can partially transfer networks from well-studied model organisms to closely related species. However, the prediction quality is limited by the low level of evolutionary conservation of the transcription factor binding sites, even within organisms of the same genus. Results: Here we present an integrated bioinformatics workflow that assures the reliability of transferred gene regulatory networks. Our approach combines three methods that can be applied on a large-scale: re-assessment of annotated binding sites, subsequent binding site prediction, and homology detection. A gene regulatory interaction is considered to be conserved if (1) the transcription factor, (2) the adjusted binding site, and (3) the target gene are conserved. The power of the approach is demonstrated by transferring gene regulations from the model organism Corynebacterium glutamicum to the human pathogens C. diphtheriae, C. jeikeium, and the biotechnologically relevant C. efficiens. For these three organisms we identified reliable transcriptional regulations for similar to 40% of the common transcription factors, compared to similar to 5% for which knowledge was available before. Conclusion: Our results suggest that trustworthy genome-scale transfer of gene regulatory networks between organisms is feasible in general but still limited by the level of evolutionary conservation
    corecore