514 research outputs found

    Family classification without domain chaining

    Get PDF
    Motivation: Classification of gene and protein sequences into homologous families, i.e. sets of sequences that share common ancestry, is an essential step in comparative genomic analyses. This is typically achieved by construction of a sequence homology network, followed by clustering to identify dense subgraphs corresponding to families. Accurate classification of single domain families is now within reach due to major algorithmic advances in remote homology detection and graph clustering. However, classification of multidomain families remains a significant challenge. The presence of the same domain in sequences that do not share common ancestry introduces false edges in the homology network that link unrelated families and stymy clustering algorithms

    Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks

    Get PDF
    BACKGROUND: The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes. RESULTS: Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric. CONCLUSION: Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes

    Ortholog identification in the presence of domain architecture rearrangement

    Get PDF
    Ortholog identification is used in gene functional annotation, species phylogeny estimation, phylogenetic profile construction and many other analyses. Bioinformatics methods for ortholog identification are commonly based on pairwise protein sequence comparisons between whole genomes. Phylogenetic methods of ortholog identification have also been developed; these methods can be applied to protein data sets sharing a common domain architecture or which share a single functional domain but differ outside this region of homology. While promiscuous domains represent a challenge to all orthology prediction methods, overall structural similarity is highly correlated with proximity in a phylogenetic tree, conferring a degree of robustness to phylogenetic methods. In this article, we review the issues involved in orthology prediction when data sets include sequences with structurally heterogeneous domain architectures, with particular attention to automated methods designed for high-throughput application, and present a case study to illustrate the challenges in this area

    CREST - a large and diverse superfamily of putative transmembrane hydrolases

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A number of membrane-spanning proteins possess enzymatic activity and catalyze important reactions involving proteins, lipids or other substrates located within or near lipid bilayers. Alkaline ceramidases are seven-transmembrane proteins that hydrolyze the amide bond in ceramide to form sphingosine. Recently, a group of putative transmembrane receptors called progestin and adipoQ receptors (PAQRs) were found to be distantly related to alkaline ceramidases, raising the possibility that they may also function as membrane enzymes.</p> <p>Results</p> <p>Using sensitive similarity search methods, we identified statistically significant sequence similarities among several transmembrane protein families including alkaline ceramidases and PAQRs. They were unified into a large and diverse superfamily of putative membrane-bound hydrolases called CREST (alkaline ceramidase, PAQR receptor, Per1, SID-1 and TMEM8). The CREST superfamily embraces a plethora of cellular functions and biochemical activities, including putative lipid-modifying enzymes such as ceramidases and the Per1 family of putative phospholipases involved in lipid remodeling of GPI-anchored proteins, putative hormone receptors, bacterial hemolysins, the TMEM8 family of putative tumor suppressors, and the SID-1 family of putative double-stranded RNA transporters involved in RNA interference. Extensive similarity searches and clustering analysis also revealed several groups of proteins with unknown function in the CREST superfamily. Members of the CREST superfamily share seven predicted core transmembrane segments with several conserved sequence motifs.</p> <p>Conclusions</p> <p>Universal conservation of a set of histidine and aspartate residues across all groups in the CREST superfamily, coupled with independent discoveries of hydrolase activities in alkaline ceramidases and the Per1 family as well as results from previous mutational studies of Per1, suggests that the majority of CREST members are metal-dependent hydrolases.</p> <p>Reviewers</p> <p>This article was reviewed by Kira S. Markarova, Igor B. Zhulin and Rob Knight.</p

    Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing

    Get PDF
    Wittkop T, Baumbach J, Lobo FP, Rahmann S. Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics. 2007;8(1): 396.Background: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. Results: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools ( Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences ( 66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. Conclusion: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/

    On subset seeds for protein alignment

    Get PDF
    We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform a comparative analysis of seeds built over those alphabets and compare them with the standard BLASTP seeding method [2], [3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seeds is less expressive (but less costly to implement) than the cumulative principle used in BLASTP and vector seeds, our seeds show a similar or even better performance than BLASTP on Bernoulli models of proteins compatible with the common BLOSUM62 matrix. Finally, we perform a large-scale benchmarking of our seeds against several main databases of protein alignments. Here again, the results show a comparable or better performance of our seeds vs. BLASTP.Comment: IEEE/ACM Transactions on Computational Biology and Bioinformatics (2009
    • 

    corecore