Search CORE

514 research outputs found

Family classification without domain chaining

Author: Bj rklund
Bolten
Crabtree
D. Durand
Demuth
Enright
Fitch
Heger
Heinicke
Huynen
J. M. Joseph
Krause
Paccanaro
Rahmann
Sasson
Song
Song
Tatusov
Wittkop
Wu
Publication venue: Oxford University Press
Publication date
Field of study

Motivation: Classification of gene and protein sequences into homologous families, i.e. sets of sequences that share common ancestry, is an essential step in comparative genomic analyses. This is typically achieved by construction of a sequence homology network, followed by clustering to identify dense subgraphs corresponding to families. Accurate classification of single domain families is now within reach due to major algorithmic advances in remote homology detection and graph clustering. However, classification of multidomain families remains a significant challenge. The presence of the same domain in sequences that do not share common ancestry introduces false edges in the homology network that link unrelated families and stymy clustering algorithms

Crossref

PubMed Central

Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks

Author: Cai Richard
Chirn Gung-Wei
Ma Qicheng
Nirmala NR
Szustakowski Joseph D
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes. RESULTS: Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric. CONCLUSION: Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes

Springer - Publisher Connector

PubMed Central

The Novartis Repository

Strongly Connected Components can Predict Protein Structure

Author: Bolten Eva
Schliep Alexander
Schneckener Sebastian
Schomburg Dietmar
Schrader Rainer
Publication venue: 'Elsevier BV'
Publication date: 01/01/2001
Field of study

Kölner UniversitätsPublikationsServer

Fast, sensitive protein sequence searches using iterative pairwise comparison of hidden Markov models

Author: Remmert Michael
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 01/01/2011
Field of study

Digitale Hochschulschriften der LMU

MPG.PuRe

Ortholog identification in the presence of domain architecture rearrangement

Author: Abascal
Addou
Altschul
Ashburner
Bairoch
Bateman
Bennett-Lovsey
Brown
Brown
Chen
Chen
Corpet
Delsuc
Dessimoz
Edgar
Eisen
G. M. Shoffner
Galperin
Gilks
Hahn
Hollich
Huelsenbeck
Jones
K. Sjolander
Kanehisa
Kaplan
Krishnamurthy
Kuzniar
Li
Meinel
O'Brien
Orengo
Pati
Pollard
Price
R. S. Datta
Saitou
Saitou
Schnoes
Servant
Sjolander
Sjolander
Sonnhammer
Storm
Storm
Tatusov
van der Heijden
Venter
Y. Shen
Zmasek
Publication venue: Oxford University Press
Publication date: 01/09/2011
Field of study

Ortholog identification is used in gene functional annotation, species phylogeny estimation, phylogenetic profile construction and many other analyses. Bioinformatics methods for ortholog identification are commonly based on pairwise protein sequence comparisons between whole genomes. Phylogenetic methods of ortholog identification have also been developed; these methods can be applied to protein data sets sharing a common domain architecture or which share a single functional domain but differ outside this region of homology. While promiscuous domains represent a challenge to all orthology prediction methods, overall structural similarity is highly correlated with proximity in a phylogenetic tree, conferring a degree of robustness to phylogenetic methods. In this article, we review the issues involved in orthology prediction when data sets include sequences with structurally heterogeneous domain architectures, with particular attention to automated methods designed for high-throughput application, and present a case study to illustrate the challenges in this area

Crossref

PubMed Central

eScholarship - University of California

CREST - a large and diverse superfamily of putative transmembrane hydrolases

Author: Douglas P Millay
Eric N Olson
Jimin Pei
Nick V Grishin
Pei Jimin
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background A number of membrane-spanning proteins possess enzymatic activity and catalyze important reactions involving proteins, lipids or other substrates located within or near lipid bilayers. Alkaline ceramidases are seven-transmembrane proteins that hydrolyze the amide bond in ceramide to form sphingosine. Recently, a group of putative transmembrane receptors called progestin and adipoQ receptors (PAQRs) were found to be distantly related to alkaline ceramidases, raising the possibility that they may also function as membrane enzymes. Results Using sensitive similarity search methods, we identified statistically significant sequence similarities among several transmembrane protein families including alkaline ceramidases and PAQRs. They were unified into a large and diverse superfamily of putative membrane-bound hydrolases called CREST (alkaline ceramidase, PAQR receptor, Per1, SID-1 and TMEM8). The CREST superfamily embraces a plethora of cellular functions and biochemical activities, including putative lipid-modifying enzymes such as ceramidases and the Per1 family of putative phospholipases involved in lipid remodeling of GPI-anchored proteins, putative hormone receptors, bacterial hemolysins, the TMEM8 family of putative tumor suppressors, and the SID-1 family of putative double-stranded RNA transporters involved in RNA interference. Extensive similarity searches and clustering analysis also revealed several groups of proteins with unknown function in the CREST superfamily. Members of the CREST superfamily share seven predicted core transmembrane segments with several conserved sequence motifs. Conclusions Universal conservation of a set of histidine and aspartate residues across all groups in the CREST superfamily, coupled with independent discoveries of hydrolase activities in alkaline ceramidases and the Per1 family as well as results from previous mutational studies of Per1, suggests that the majority of CREST members are metal-dependent hydrolases. Reviewers This article was reviewed by Kira S. Markarova, Igor B. Zhulin and Rob Knight.</p

Crossref

Springer

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing

Author: Baumbach Jan
Lobo Francisco P
Rahmann Sven
Wittkop Tobias
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Wittkop T, Baumbach J, Lobo FP, Rahmann S. Large scale clustering of protein sequences with FORCE - a layout based heuristic for weighted cluster editing. BMC Bioinformatics. 2007;8(1): 396.Background: Detecting groups of functionally related proteins from their amino acid sequence alone has been a long-standing challenge in computational genome research. Several clustering approaches, following different strategies, have been published to attack this problem. Today, new sequencing technologies provide huge amounts of sequence data that has to be efficiently clustered with constant or increased accuracy, at increased speed. Results: We advocate that the model of weighted cluster editing, also known as transitive graph projection is well-suited to protein clustering. We present the FORCE heuristic that is based on transitive graph projection and clusters arbitrary sets of objects, given pairwise similarity measures. In particular, we apply FORCE to the problem of protein clustering and show that it outperforms the most popular existing clustering tools ( Spectral clustering, TribeMCL, GeneRAGE, Hierarchical clustering, and Affinity Propagation). Furthermore, we show that FORCE is able to handle huge datasets by calculating clusters for all 192 187 prokaryotic protein sequences ( 66 organisms) obtained from the COG database. Finally, FORCE is integrated into the corynebacterial reference database CoryneRegNet. Conclusion: FORCE is an applicable alternative to existing clustering algorithms. Its theoretical foundation, weighted cluster editing, can outperform other clustering paradigms on protein homology clustering. FORCE is open source and implemented in Java. The software, including the source code, the clustering results for COG and CoryneRegNet, and all evaluation datasets are available at http://gi.cebitec.uni-bielefeld.de/comet/force/

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Publications at Bielefeld University

On subset seeds for protein alignment

Author: Furletova Eugenia
Gambin Anna
Kucherov Gregory
Lasota Slawomir
Noé Laurent
Roytberg Mikhail A.
Szczurek Ewa
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform a comparative analysis of seeds built over those alphabets and compare them with the standard BLASTP seeding method [2], [3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seeds is less expressive (but less costly to implement) than the cumulative principle used in BLASTP and vector seeds, our seeds show a similar or even better performance than BLASTP on Bernoulli models of proteins compatible with the common BLOSUM62 matrix. Finally, we perform a large-scale benchmarking of our seeds against several main databases of protein alignments. Here again, the results show a comparable or better performance of our seeds vs. BLASTP.Comment: IEEE/ACM Transactions on Computational Biology and Bioinformatics (2009

arXiv.org e-Print Archive

CiteSeerX

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

MPG.PuRe