2,696 research outputs found
Accelerated Profile HMM Searches
Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the âmultiple segment Viterbiâ (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call âsparse rescalingâ. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches
Systematic identification of gene families for use as markers for phylogenetic and phylogeny- driven ecological studies of bacteria and archaea and their major subgroups
With the astonishing rate that the genomic and metagenomic sequence data sets
are accumulating, there are many reasons to constrain the data analyses. One
approach to such constrained analyses is to focus on select subsets of gene
families that are particularly well suited for the tasks at hand. Such gene
families have generally been referred to as marker genes. We are particularly
interested in identifying and using such marker genes for phylogenetic and
phylogeny-driven ecological studies of microbes and their communities. We
therefore refer to these as PhyEco (for phylogenetic and phylogenetic ecology)
markers. The dual use of these PhyEco markers means that we needed to develop
and apply a set of somewhat novel criteria for identification of the best
candidates for such markers. The criteria we focused on included universality
across the taxa of interest, ability to be used to produce robust phylogenetic
trees that reflect as much as possible the evolution of the species from which
the genes come, and low variation in copy number across taxa. We describe here
an automated protocol for identifying potential PhyEco markers from a set of
complete genome sequences. The protocol combines rapid searching, clustering
and phylogenetic tree building algorithms to generate protein families that
meet the criteria listed above. We report here the identification of PhyEco
markers for different taxonomic levels including 40 for all bacteria and
archaea, 114 for all bacteria, and much more for some of the individual phyla
of bacteria. This new list of PhyEco markers should allow much more detailed
automated phylogenetic and phylogenetic ecology analyses of these groups than
possible previously.Comment: 24 pages, 3 figure
Potential conservation of circadian clock proteins in the phylum Nematoda as revealed by bioinformatic searches
Although several circadian rhythms have been described in C. elegans, its molecular clock remains elusive. In this work we employed a novel bioinformatic approach, applying probabilistic methodologies, to search for circadian clock proteins of several of the best studied circadian model organisms of different taxa (Mus musculus, Drosophila melanogaster, Neurospora crassa, Arabidopsis thaliana and Synechoccocus elongatus) in the proteomes of C. elegans and other members of the phylum Nematoda. With this approach we found that the Nematoda contain proteins most related to the core and accessory proteins of the insect and mammalian clocks, which provide new insights into the nematode clock and the evolution of the circadian system.Fil: Romanowski, AndrĂ©s. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Oficina de CoordinaciĂłn Administrativa Parque Centenario. Instituto de Investigaciones BioquĂmicas de Buenos Aires. FundaciĂłn Instituto Leloir. Instituto de Investigaciones BioquĂmicas de Buenos Aires; Argentina. Universidad Nacional de Quilmes. Departamento de Ciencia y TecnologĂa. Laboratorio de CronobiologĂa; ArgentinaFil: Garavaglia, MatĂas Javier. Universidad Nacional de Quilmes. Departamento de Ciencia y TecnologĂa. Laboratorio de Ing.genĂ©tica y Biolog.molecular y Celular. Area Virus de Insectos; Argentina. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas; ArgentinaFil: Goya, MarĂa Eugenia. Universidad Nacional de Quilmes. Departamento de Ciencia y TecnologĂa. Laboratorio de CronobiologĂa; Argentina. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas; ArgentinaFil: Ghiringhelli, Pablo Daniel. Universidad Nacional de Quilmes. Departamento de Ciencia y TecnologĂa. Laboratorio de Ing.genĂ©tica y Biolog.molecular y Celular. Area Virus de Insectos; Argentina. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas; ArgentinaFil: Golombek, Diego Andres. Universidad Nacional de Quilmes. Departamento de Ciencia y TecnologĂa. Laboratorio de CronobiologĂa; Argentina. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas; Argentin
MRFalign: Protein Homology Detection through Alignment of Markov Random Fields
Sequence-based protein homology detection has been extensively studied and so
far the most sensitive method is based upon comparison of protein sequence
profiles, which are derived from multiple sequence alignment (MSA) of sequence
homologs in a protein family. A sequence profile is usually represented as a
position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and
accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This
paper presents a new homology detection method MRFalign, consisting of three
key components: 1) a Markov Random Fields (MRF) representation of a protein
family; 2) a scoring function measuring similarity of two MRFs; and 3) an
efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning
two MRFs. Compared to HMM that can only model very short-range residue
correlation, MRFs can model long-range residue interaction pattern and thus,
encode information for the global 3D structure of a protein family.
Consequently, MRF-MRF comparison for remote homology detection shall be much
more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that
MRFalign outperforms several popular HMM or PSSM-based methods in terms of both
alignment accuracy and remote homology detection and that MRFalign works
particularly well for mainly beta proteins. For example, tested on the
benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM
succeed on 48% and 52% of proteins, respectively, at superfamily level, and on
15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign
succeeds on 57.3% and 42.5% of proteins at superfamily and fold level,
respectively. This study implies that long-range residue interaction patterns
are very helpful for sequence-based homology detection. The software is
available for download at http://raptorx.uchicago.edu/download/.Comment: Accepted by both RECOMB 2014 and PLOS Computational Biolog
Structural evolution drives diversification of the large LRR-RLK gene family
Cells are continuously exposed to chemical signals that they must discriminate between and respond to appropriately. In embryophytes, the leucineârich repeat receptorâlike kinases (LRRâRLKs) are signal receptors critical in development and defense. LRRâRLKs have diversified to hundreds of genes in many plant genomes. Although intensively studied, a wellâresolved LRRâRLK gene tree has remained elusive. To resolve the LRRâRLK gene tree, we developed an improved gene discovery method based on iterative hidden Markov model searching and phylogenetic inference. We used this method to infer complete gene trees for each of the LRRâRLK subclades and reconstructed the deepest nodes of the full gene family. We discovered that the LRRâRLK gene family is even larger than previously thought, and that protein domain gains and losses are prevalent. These structural modifications, some of which likely predate embryophyte diversification, led to misclassification of some LRRâRLK variants as members of other gene families. Our work corrects this misclassification. Our results reveal ongoing structural evolution generating novel LRRâRLK genes. These new genes are raw material for the diversification of signaling in development and defense. Our methods also enable phylogenetic reconstruction in any large gene family
Alignment of helical membrane protein sequences using AlignMe
Few sequence alignment methods have been designed specifically for integral membrane proteins, even though these important proteins have distinct evolutionary and structural properties that might affect their alignments. Existing approaches typically consider membrane-related information either by using membrane-specific substitution matrices or by assigning distinct penalties for gap creation in transmembrane and non-transmembrane regions. Here, we ask whether favoring matching of predicted transmembrane segments within a standard dynamic programming algorithm can improve the accuracy of pairwise membrane protein sequence alignments. We tested various strategies using a specifically designed program called AlignMe. An updated set of homologous membrane protein structures, called HOMEP2, was used as a reference for optimizing the gap penalties. The best of the membrane-protein optimized approaches were then tested on an independent reference set of membrane protein sequence alignments from the BAliBASE collection. When secondary structure (S) matching was combined with evolutionary information (using a position-specific substitution matrix (P)), in an approach we called AlignMePS, the resultant pairwise alignments were typically among the most accurate over a broad range of sequence similarities when compared to available methods. Matching transmembrane predictions (T), in addition to evolutionary information, and secondary-structure predictions, in an approach called AlignMePST, generally reduces the accuracy of the alignments of closely-related proteins in the BAliBASE set relative to AlignMePS, but may be useful in cases of extremely distantly related proteins for which sequence information is less informative. The open source AlignMe code is available at https://sourceforge.net/projects/alignmeâ/, and at http://www.forrestlab.org, along with an online server and the HOMEP2 data set
- âŠ