184 research outputs found

    Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing

    Get PDF
    Wittkop T, Baumbach J, Lobo FP, Rahmann S. Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics. 2007;8(1): 396.Background: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. Results: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools ( Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences ( 66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. Conclusion: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/

    Efficient algorithms for gene cluster detection in prokaryotic genomes

    Get PDF
    Schmidt T. Efficient algorithms for gene cluster detection in prokaryotic genomes. Bielefeld (Germany): Bielefeld University; 2005.The research in genomics science rapidly emerged in the last few years, and the availability of completely sequenced genomes continuously increases due to the use of semi-automatic sequencing machines. Also these sequences, mostly prokaryotic ones, are well annotated, which means that the positions of their genes and parts of their regulatory or metabolic pathways are known. A new task in the field of bioinformatics now is to gain gene or protein information from the comparison of genomes on a higher level. In the approach of "comparative genomics" researchers in bioinformatics are attempting to locate groups or clusters of orthologous genes that may have the same function in multiple genomes. These researches are often anchored on the simple, but biologically verified fact, that functionally related proteins are usually coded by genes placed in a region of close genomic neighborhood, in different species. From an algorithmic and combinatorial point of view, the first descriptions of the concept of "closely placed genes" were only fragmentary, and sometimes confusing. The given algorithms often lack the necessary grounds to prove their correctness, or assess their complexity. Within the first formal models of a conserved genomic neighborhood, genomes are often represented as permutations of their genes, and common intervals, i.e. intervals containing the same set of genes, are interpreted as gene clusters. But here the major disadvantage of representing genomes as permutations is the fact that paralogous copies of the same gene inside one genome can not be modelled. Since especially large genomes contain numerous paralogous genes, this model is insufficient to be used on real genomic data. In this work, we consider a modified model of gene clusters that allows paralogs, simply by representing genomes as sequences rather than permutations of genes. We define common intervals based on this model, and we present a simple algorithm that finds all common intervals of two sequences in [Theta](n2) time using [Theta](n2) space. Another, more complicated algorithm runs in [Omikron](n2) time and uses only linear space. We also show how to extend these algorithms to more than two genomes and present the implementation of the algorithms as well as the visualization of the located clusters in the tool Gecko. Since the creation of the string representation of a set of genomes is a non-trivial task, we also present the data preparation tool GhostFam that groups all genes from the given set of genomes to their families of homologs. In the evaluation on a set of 20 bacterial genomes, we show that with the presented approach it is possible to correctly locate gene clusters that are known from the literature, and to successfully predict new groups of functionally related genes

    Systematic identification of functional plant modules through the integration of complementary data sources

    Get PDF
    A major challenge is to unravel how genes interact and are regulated to exert specific biological functions. The integration of genome-wide functional genomics data, followed by the construction of gene networks, provides a powerful approach to identify functional gene modules. Large-scale expression data, functional gene annotations, experimental protein-protein interactions, and transcription factor-target interactions were integrated to delineate modules in Arabidopsis (Arabidopsis thaliana). The different experimental input data sets showed little overlap, demonstrating the advantage of combining multiple data types to study gene function and regulation. In the set of 1,563 modules covering 13,142 genes, most modules displayed strong coexpression, but functional and cis-regulatory coherence was less prevalent. Highly connected hub genes showed a significant enrichment toward embryo lethality and evidence for cross talk between different biological processes. Comparative analysis revealed that 58% of the modules showed conserved coexpression across multiple plants. Using module-based functional predictions, 5,562 genes were annotated, and an evaluation experiment disclosed that, based on 197 recently experimentally characterized genes, 38.1% of these functions could be inferred through the module context. Examples of confirmed genes of unknown function related to cell wall biogenesis, xylem and phloem pattern formation, cell cycle, hormone stimulus, and circadian rhythm highlight the potential to identify new gene functions. The module-based predictions offer new biological hypotheses for functionally unknown genes in Arabidopsis (1,701 genes) and six other plant species (43,621 genes). Furthermore, the inferred modules provide new insights into the conservation of coexpression and coregulation as well as a starting point for comparative functional annotation

    MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>MapReduce is a parallel framework that has been used effectively to design large-scale parallel applications for large computing clusters. In this paper, we evaluate the viability of the MapReduce framework for designing phylogenetic applications. The problem of interest is generating the all-to-all Robinson-Foulds distance matrix, which has many applications for visualizing and clustering large collections of evolutionary trees. We introduce MrsRF (<it>MapReduce Speeds up RF</it>), a multi-core algorithm to generate a <it>t </it>× <it>t </it>Robinson-Foulds distance matrix between <it>t </it>trees using the MapReduce paradigm.</p> <p>Results</p> <p>We studied the performance of our MrsRF algorithm on two large biological trees sets consisting of 20,000 trees of 150 taxa each and 33,306 trees of 567 taxa each. Our experiments show that MrsRF is a scalable approach reaching a speedup of over 18 on 32 total cores. Our results also show that achieving top speedup on a multi-core cluster requires different cluster configurations. Finally, we show how to use an RF matrix to summarize collections of phylogenetic trees visually.</p> <p>Conclusion</p> <p>Our results show that MapReduce is a promising paradigm for developing multi-core phylogenetic applications. The results also demonstrate that different multi-core configurations must be tested in order to obtain optimum performance. We conclude that RF matrices play a critical role in developing techniques to summarize large collections of trees.</p

    PRIMER- A statistical curtain raiser. In: Winter School on Impact of Climate Change on Indian Marine Fisheries held at CMFRI, Cochin 18.1.2008 to 7.2.2008

    Get PDF
    PRIMER (Plymouth Routines In Multivariate Ecological Research) is a software aimed at analyzing data arising out of ecological and environmental investigations. But the scope of the software does not stop there. It is amenable to farther ranges and more applications once customized along with subtle pre and post processing maneuvers. While it can be grouped alongside any other multi-utility statistical software like SPSS, SYSTAT etc., it differs significantly from the bunch on its typicality of usage and the output generated by it followed by its interpretation. It is one of the few select software that prioritizes multivariate data analysis as deemed fit for environmental and ecological studies

    A Consistent Phylogenetic Backbone for the Fungi

    Get PDF
    The kingdom of fungi provides model organisms for biotechnology, cell biology, genetics, and life sciences in general. Only when their phylogenetic relationships are stably resolved, can individual results from fungal research be integrated into a holistic picture of biology. However, and despite recent progress, many deep relationships within the fungi remain unclear. Here, we present the first phylogenomic study of an entire eukaryotic kingdom that uses a consistency criterion to strengthen phylogenetic conclusions. We reason that branches (splits) recovered with independent data and different tree reconstruction methods are likely to reflect true evolutionary relationships. Two complementary phylogenomic data sets based on 99 fungal genomes and 109 fungal expressed sequence tag (EST) sets analyzed with four different tree reconstruction methods shed light from different angles on the fungal tree of life. Eleven additional data sets address specifically the phylogenetic position of Blastocladiomycota, Ustilaginomycotina, and Dothideomycetes, respectively. The combined evidence from the resulting trees supports the deep-level stability of the fungal groups toward a comprehensive natural system of the fungi. In addition, our analysis reveals methodologically interesting aspects. Enrichment for EST encoded data—a common practice in phylogenomic analyses—introduces a strong bias toward slowly evolving and functionally correlated genes. Consequently, the generalization of phylogenomic data sets as collections of randomly selected genes cannot be taken for granted. A thorough characterization of the data to assess possible influences on the tree reconstruction should therefore become a standard in phylogenomic analyses
    corecore