12 research outputs found

    A systematic comparison of genome-scale clustering algorithms

    Get PDF
    Background: A wealth of clustering algorithms has been applied to gene co-expression experiments. These algorithms cover a broad range of approaches, from conventional techniques such as k-means and hierarchical clustering, to graphical approaches such as k-clique communities, weighted gene co-expression networks (WGCNA) and paraclique. Comparison of these methods to evaluate their relative effectiveness provides guidance to algorithm selection, development and implementation. Most prior work on comparative clustering evaluation has focused on parametric methods. Graph theoretical methods are recent additions to the tool set for the global analysis and decomposition of microarray co-expression matrices that have not generally been included in earlier methodological comparisons. In the present study, a variety of parametric and graph theoretical clustering algorithms are compared using well-characterized transcriptomic data at a genome scale from Saccharomyces cerevisiae. Methods: For each clustering method under study, a variety of parameters were tested. Jaccard similarity was used to measure each clusters agreement with every GO and KEGG annotation set, and the highest Jaccard score was assigned to the cluster. Clusters were grouped into small, medium, and large bins, and the Jaccard score of the top five scoring clusters in each bin were averaged and reported as the best average top 5 (BAT5) score for the particular method. Results: Clusters produced by each method were evaluated based upon the positive match to known pathways. This produces a readily interpretable ranking of the relative effectiveness of clustering on the genes. Methods were also tested to determine whether they were able to identify clusters consistent with those identified by other clustering methods. Conclusions: Validation of clusters against known gene classifications demonstrate that for this data, graph-based techniques outperform conventional clustering approaches, suggesting that further development and application of combinatorial strategies is warranted

    Cancer stem cell gene profile as predictor of relapse in high risk stage II and stage III, radically resected colon cancer patients.

    Get PDF
    Clinical data indicate that prognostic stratification of radically resected colorectal cancer based on disease stage only may not be always be adequate. Preclinical findings suggest that cancer stem cells may influence the biological behaviour of colorectal cancer independently from stage: objective of the study was to assess whether a panel of stemness markers were correlated with clinical outcome in resected stage II and III colon cancer patients. A panel of 66 markers of stemness were analysed and thus patients were divided into two groups (A and B) with most patients clustering in a manner consistent with different time to relapse by using a statistical algorithm. A total of 62 patients were analysed. Thirty-six (58%) relapsed during the follow-up period (range 1.63-86.5 months). Twelve (19%) and 50 (81%) patients were allocated into group A and B, respectively. A significantly different median relapse-free survival was observed between the 2 groups (22.18 vs 42.85 months, p=0.0296). Among of all genes tested, those with the higher "weight" in determining different prognosis were CD44, ALCAM, DTX2, HSPA9, CCNA2, PDX1, MYST1, COL1A1 and ABCG2. This analysis supports the idea that, other than stage, biological variables, such as expression levels of colon cancer stem cell genes, may be relevant in determining an increased risk of relapse in resected colorectal cancer patient

    A novel approach identifies the first transcriptome networks in bats: a new genetic model for vocal communication

    Get PDF
    Background: Bats are able to employ an astonishingly complex vocal repertoire for navigating their environment and conveying social information. A handful of species also show evidence for vocal learning, an extremely rare ability shared only with humans and few other animals. However, despite their potential for the study of vocal communication, bats remain severely understudied at a molecular level. To address this fundamental gap we performed the first transcriptome profiling and genetic interrogation of molecular networks in the brain of a highly vocal bat species, Phyllostomus discolor. Results: Gene network analysis typically needs large sample sizes for correct clustering, this can be prohibitive where samples are limited, such as in this study. To overcome this, we developed a novel bioinformatics methodology for identifying robust co-expression gene networks using few samples (N=6). Using this approach, we identified tissue-specific functional gene networks from the bat PAG, a brain region fundamental for mammalian vocalisation. The most highly connected network identified represented a cluster of genes involved in glutamatergic synaptic transmission. Glutamatergic receptors play a significant role in vocalisation from the PAG, suggesting that this gene network may be mechanistically important for vocal-motor control in mammals. Conclusion: We have developed an innovative approach to cluster co-expressing gene networks and show that it is highly effective in detecting robust functional gene networks with limited sample sizes. Moreover, this work represents the first gene network analysis performed in a bat brain and establishes bats as a novel, tractable model system for understanding the genetics of vocal mammalian communication

    Advances in Big Data Analytics: Algorithmic Stability and Data Cleansing

    Get PDF
    Analysis of what has come to be called “big data” presents a number of challenges as data continues to grow in size, complexity and heterogeneity. To help addresses these challenges, we study a pair of foundational issues in algorithmic stability (robustness and tuning), with application to clustering in high-throughput computational biology, and an issue in data cleansing (outlier detection), with application to pre-processing in streaming meteorological measurement. These issues highlight major ongoing research aspects of modern big data analytics. First, a new metric, robustness, is proposed in the setting of biological data clustering to measure an algorithm’s tendency to maintain output coherence over a range of parameter settings. It is well known that different algorithms tend to produce different clusters, and that the choice of algorithm is often driven by factors such as data size and type, similarity measure(s) employed, and the sort of clusters desired. Even within the context of a single algorithm, clusters often vary drastically depending on parameter settings. Empirical comparisons performed over a variety of algorithms and settings show highly differential performance on transcriptomic data and demonstrate that many popular methods actually perform poorly. Second, tuning strategies are studied for maximizing biological fidelity when using the well-known paraclique algorithm. Three initialization strategies are compared, using ontological enrichment as a proxy for cluster quality. Although extant paraclique codes begin by simply employing the first maximum clique found, results indicate that by generating all maximum cliques and then choosing one of highest average edge weight, one can produce a small but statistically significant expected improvement in overall cluster quality. Third, a novel outlier detection method is described that helps cleanse data by combining Pearson correlation coefficients, K-means clustering, and Singular Spectrum Analysis in a coherent framework that detects instrument failures and extreme weather events in Atmospheric Radiation Measurement sensor data. The framework is tested and found to produce more accurate results than do traditional approaches that rely on a hand-annotated database

    Analytical, Theoretical and Empirical Advances in Genome-Scale Algorithmics

    Get PDF
    Ever-increasing amounts of complex biological data continue to come on line daily. Examples include proteomic, transcriptomic, genomic and metabolomic data generated by a plethora of high-throughput methods. Accordingly, fast and effective data processing techniques are more and more in demand. This issue is addressed in this dissertation through an investigation of various algorithmic alternatives and enhancements to routine and traditional procedures in common use. In the analysis of gene co-expression data, for example, differential measures of entropy and variation are studied as augmentations to mere differential expression. These novel metrics are shown to help elucidate disease-related genes in wide assortments of case/control data. In a more theoretical spirit, limits on the worst-case behavior of density based clustering methods are studied. It is proved, for instance, that the well-known paraclique algorithm, under proper tuning, can be guaranteed never to produce subgraphs with density less than 2/3. Transformational approaches to efficient algorithm design are also considered. Classic graph search problems are mapped to and from well-studied versions of satisfiability and integer linear programming. In so doing, regions of the input space are classified for which such transforms are effective alternatives to direct graph optimizations. In all these efforts, practical implementations are emphasized in order to advance the boundary of effective computation

    Graph-Theoretical Tools for the Analysis of Complex Networks

    Get PDF
    We are currently experiencing an explosive growth in data collection technology that threatens to dwarf the commensurate gains in computational power predicted by Moore’s Law. At the same time, researchers across numerous domain sciences are finding success using network models to represent their data. Graph algorithms are then applied to study the topological structure and tease out latent relationships between variables. Unfortunately, the problems of interest, such as finding dense subgraphs, are often the most difficult to solve from a computational point of view. Together, these issues motivate the need for novel algorithmic techniques in the study of graphs derived from large, complex, data sources. This dissertation describes the development and application of graph theoretic tools for the study of complex networks. Algorithms are presented that leverage efficient, exact solutions to difficult combinatorial problems for epigenetic biomarker detection and disease subtyping based on gene expression signatures. Extensive testing on publicly available data is presented supporting the efficacy of these approaches. To address efficient algorithm design, a study of the two core tenets of fixed parameter tractability (branching and kernelization) is considered in the context of a parallel implementation of vertex cover. Results of testing on a wide variety of graphs derived from both real and synthetic data are presented. It is shown that the relative success of kernelization versus branching is found to be largely dependent on the degree distribution of the graph. Throughout, an emphasis is placed upon the practicality of resulting implementations to advance the limits of effective computation

    Caracterização genÎmica (taxonomia e simbiose) e fenotípica (controle biológico de fitopatógenos) de bactérias isoladas de feijoeiro da Coleção SEMIA : revisão taxonÎmica da Ordem Rhizobiales (Hyphomicrobiales)

    Get PDF
    A Coleção SEMIA existe oficialmente desde 1975 e Ă© referĂȘncia internacional na ĂĄrea de inoculantes. Essa coleção mantĂ©m mais de 1.200 estirpes de bactĂ©rias isoladas de nĂłdulos de 171 leguminosas de importĂąncia agrĂ­cola, das quais 98 sĂŁo recomendadas para o uso em inoculantes. Grande parte da Coleção SEMIA foi identificada utilizando caracterĂ­sticas bioquĂ­micas, PCR baseada em elementos repetitivos, identificaçÔes sorolĂłgicas e de planta hospedeira, e, menor nĂșmero, o sequenciamento parcial do gene do 16S rRNA. Entretanto, ainda faltam informaçÔes taxonĂŽmicas das estirpes SEMIA, com base nos mĂ©todos moleculares baseados em anĂĄlise de genomas aceitos atualmente. Em vista disso, o presente trabalho se propĂŽs a i) elucidar o potencial das estirpes SEMIA para o controle biolĂłgico de fungos patogĂȘnicos e ii) resolver problemas de taxonomia dentro da Coleção SEMIA e da prĂłpria ordem Rhizobiales (Hyphomicrobiales). O capĂ­tulo I, “Rhizobia for biological control of plant diseases”, Ă© uma revisĂŁo sobre os mecanismos empregados para a eficĂĄcia dos rizĂłbios no biocontrole de doenças causadas por diferentes classes de fitopatĂłgenos. O capĂ­tulo II, intitulado “Rhizobium strains in the biological control of the phytopathogenic fungi Sclerotium (Athelia) rolfsii on the common bean” Ă© um artigo de pesquisa que avaliou 78 isolados de feijĂŁo da coleção de cultura SEMIA para identificar agentes de biocontrole contra o fitopatĂłgeno S. rolfsii. Demonstramos que estirpes estirpes isoladas de nĂłdulos podem ser fortes antagonistas ao crescimento S. rolfsii e ser eficazes no controle da doença provocada pelo mesmo Ă  campo. No CapĂ­tulo III, “Reclassification of Ochrobactrum lupini as a later heterotypic synonym of Ochrobactrum anthropi based on whole-genome sequence analysis”, demonstramos com dados filogenĂ©ticos, genĂŽmicos, fenotĂ­picos e quimiotaxonĂŽmicos que O. lupini deve ser considerado a mesma espĂ©cie de O anthropi. O CapĂ­tulo IV, “Genomic metrics applied to Rhizobiales (Hyphomicrobiales): species reclassification, identification of unauthentic genomes and false type strains”, apresenta a taxonomia atualizada da ordem Hyphomicrobiales, com base em 270.400 comparaçÔes analisadas com um corte de 95% de ANI para extrair clusters de genoma com alta identidade atravĂ©s do uso da ferramenta ProKlust descrita. Esse trabalho originou uma sĂ©rie de propostas de reclassificaçÔes taxonĂŽmicas, alĂ©m da descoberta de acessos de genoma que nĂŁo era das estirpes-tipo genuĂ­nas utilizadas para as respectivas descriçÔes de “suas espĂ©cies”, bem como casos de uso indevido do termo “estirpe-tipo” no banco de dados. No CapĂ­tulo IV, "Analysis of 95+ genomes from the common-bean branch from SEMIA collection: new genomospecies, alternative nitrogenases, horizontal gene transfer events, and unexpected genera of nodule-associated bacteria", sequenciamos os genomas de 96 estirpes SEMIA, relatando 15 clusters de genoespĂ©cies, bem como, 12 genoespĂ©cies isoladas, que surgiram de 1.322.500 comparaçÔes de ANI em pares entre as estirpes SEMIA e 1.053 genomas pertencentes a Burkholderiaceae, Comamonadaceae, Mycobacteriaceae, Rhizobiaceae, e Xanthomonadaceae. As estirpes foram identificadas como pertencentes a nove espĂ©cies diferentes de Rhizobium, Agrobacterium radiobacter, Pararhizobium giardinii, Paraburkholderia fungorum e as espĂ©cies putativas associadas a nĂłdulos Mycobacterium monacense, Stenotrophomonas maltophilia e Variovorax guangxiensis. Cerca de um terço da coleção foi identificado como novas espĂ©cies potenciais. A anĂĄlise do pangenoma das estirpes SEMIA resultou em 50.221 clusters de genes contendo 604.752 genes. A presença de genes relacionados Ă s nitrogenases alternativas foi detectada entre representantes pertencentes a M. monacense, P. fungorum e V. guangxiens, bem como nas novas espĂ©cies putativas G11 e G9. A presença de homĂłlogos nifH foi exclusiva para 55 estirpes pertencentes a Rhizobium. A detecção de sobreposição com sequĂȘncias extracromossĂŽmicas foi encontrada apenas entre representantes de Rhizobium e P. fungorum. VĂĄrios genes de transposase foram localizados a montante e a jusante dos operons nifHDKENX e nifHDKE detectados, indicando eventos de transferĂȘncia horizontal. Uma ampla distribuição filogenĂ©tica foi encontrada no nĂ­vel da famĂ­lia e um nĂșmero notĂĄvel (≄40) de genes transferidos putativos foram encontrados especialmente entre 12 estirpes, incluindo eventos de transferĂȘncia putativos de outros domĂ­nios, como a famĂ­lia botĂąnica Euphorbiaceae, Aspergillaceae e Siphoviridae. Conjuntos de genes biossintĂ©ticos putativos foram identificados. A reclassificação de mais de 25 espĂ©cies bacterianas tambĂ©m foi proposta com base nas comparaçÔes entre os genomas das estirpes-tipo.The SEMIA Collection has officially existed since 1975 and is an international reference in the field of inoculants. This collection holds more than 1,200 strains of bacteria isolated from the nodules of 171 legumes of agricultural importance, 98 of which are recommended for use in inoculants. A large part of the SEMIA Collection was identified using biochemical characteristics, PCR based on repetitive elements, serological and host plant identification, and, to a lesser extent, the partial sequencing of the 16S rRNA gene. However, taxonomic information on SEMIA strains is still lacking, based on currently accepted molecular genome-based methods. In view of this, the present work aimed to i) elucidate the potential of SEMIA strains for the biological control of pathogenic fungi and ii) solve taxonomy problems within the SEMIA Collection and the order Rhizobiales (Hyphomicrobiales) itself. Chapter I, “Rhizobia for biological control of plant diseases”, is a review regarding rhizobial mechanisms and efficacy to biocontrol diseases caused by different classes of plant pathogens. Chapter II, entitled “Rhizobium strains in the biological control of the phytopathogenic fungi Sclerotium (Athelia) rolfsii on the common bean” is a research article that evaluated 78 common bean isolates from SEMIA culture collection to identify biocontrol agents against the plant pathogen S. rolfsii. We demonstrated that root-isolated strains can be strong antagonists to S. rolfsii growth and be effective in controlling the disease caused by this pathogen in the field. In the Chapter III, “Reclassification of Ochrobactrum lupini as a later heterotypic synonym of Ochrobactrum anthropi based on whole-genome sequence analysis”, we demonstrated with phylogenetic, genomic, phenotypic, and chemotaxonomic data that O. lupini should be considered the same species of O. anthropi. The Chapter IV, “Genomic metrics applied to Rhizobiales (Hyphomicrobiales): species reclassification, identification of unauthentic genomes and false type strains”, presents the updated taxonomy of the order Hyphomicrobiales, based on 270,400 comparisons analyzed with a 95% ANI cut-off to extract high identity genome clusters using the described ProKlust tool. This work has led to a series of proposals for taxonomic reclassifications, in addition to discover of genome accessions that are not from the genuine type strains used for the respective species descriptions as well as cases of misuse of the term “type strain”. In the Chapter IV, “Analysis of 95+ genomes from the common-bean branch from SEMIA collection: new genomospecies, alternative nitrogenases, horizontal gene transfer events, and unexpected genera of nodule-associated bacteria”, we sequenced the genomes from 96 SEMIA strains, reporting 15 genospecies clusters as well as 12 isolated genospecies that arised from the 1,322,500 ANI pairwise comparisons between the SEMIA strains and 1,053 genomes belonging to Burkholderiaceae, Comamonadaceae, Mycobacteriaceae, Rhizobiaceae, and Xanthomonadaceae. The strains were identified as belonging to nine different Rhizobium species, Agrobacterium radiobacter, Pararhizobium giardinii, Paraburkholderia fungorum and the putative nodule-associated species Mycobacterium monacense, Stenotrophomonas maltophilia, and Variovorax guangxiensis. Around one-third of the collection were identified as new potential species. The pangenome analysis of SEMIA resulted in 50,221 gene clusters containing 604,752 genes. The presence of alternative nitrogenases relatedgenes was detected among representatives belonging to M. monacense, P. fungorum and V. guangxiens, as well as in the putative new species G11 and G9. The presence of nifH homologs was as exclusive to 55 strains belonging to Rhizobium. The detection of overlap with extrachromosomal sequences was found only among representatives from Rhizobium and P. fungorum. Multiple transposase genes were located upstream and downstream of the detected nifHDKENX and nifHDKE operons, indicating HGT events. A wide phylogenetic distribution was found at the family level and an outstanding number (≄40) of putative transferred genes were found especially among 12 strains, including putative transfer events from other Domains such as the botanical family Euphorbiaceae, Aspergillaceae, and Siphoviridae. Putative biosynthetic gene clusters were identified. Reclassification of over 25 bacterial species was also proposed based on the comparisons between the type-strain genomes

    Contextual Analysis of Large-Scale Biomedical Associations for the Elucidation and Prioritization of Genes and their Roles in Complex Disease

    Get PDF
    Vast amounts of biomedical associations are easily accessible in public resources, spanning gene-disease associations, tissue-specific gene expression, gene function and pathway annotations, and many other data types. Despite this mass of data, information most relevant to the study of a particular disease remains loosely coupled and difficult to incorporate into ongoing research. Current public databases are difficult to navigate and do not interoperate well due to the plethora of interfaces and varying biomedical concept identifiers used. Because no coherent display of data within a specific problem domain is available, finding the latent relationships associated with a disease of interest is impractical. This research describes a method for extracting the contextual relationships embedded within associations relevant to a disease of interest. After applying the method to a small test data set, a large-scale integrated association network is constructed for application of a network propagation technique that helps uncover more distant latent relationships. Together these methods are adept at uncovering highly relevant relationships without any a priori knowledge of the disease of interest. The combined contextual search and relevance methods power a tool which makes pertinent biomedical associations easier to find, easier to assimilate into ongoing work, and more prominent than currently available databases. Increasing the accessibility of current information is an important component to understanding high-throughput experimental results and surviving the data deluge

    Multipartite Graph Algorithms for the Analysis of Heterogeneous Data

    Get PDF
    The explosive growth in the rate of data generation in recent years threatens to outpace the growth in computer power, motivating the need for new, scalable algorithms and big data analytic techniques. No field may be more emblematic of this data deluge than the life sciences, where technologies such as high-throughput mRNA arrays and next generation genome sequencing are routinely used to generate datasets of extreme scale. Data from experiments in genomics, transcriptomics, metabolomics and proteomics are continuously being added to existing repositories. A goal of exploratory analysis of such omics data is to illuminate the functions and relationships of biomolecules within an organism. This dissertation describes the design, implementation and application of graph algorithms, with the goal of seeking dense structure in data derived from omics experiments in order to detect latent associations between often heterogeneous entities, such as genes, diseases and phenotypes. Exact combinatorial solutions are developed and implemented, rather than relying on approximations or heuristics, even when problems are exceedingly large and/or difficult. Datasets on which the algorithms are applied include time series transcriptomic data from an experiment on the developing mouse cerebellum, gene expression data measuring acute ethanol response in the prefrontal cortex, and the analysis of a predicted protein-protein interaction network. A bipartite graph model is used to integrate heterogeneous data types, such as genes with phenotypes and microbes with mouse strains. The techniques are then extended to a multipartite algorithm to enumerate dense substructure in multipartite graphs, constructed using data from three or more heterogeneous sources, with applications to functional genomics. Several new theoretical results are given regarding multipartite graphs and the multipartite enumeration algorithm. In all cases, practical implementations are demonstrated to expand the frontier of computational feasibility
    corecore