Search CORE

Directory of Open Access Journals

Columbia University Academic Commons

Recommended from our members

Using structure to explore the sequence alignment space of remote homologs

Author: Kuziemko Andrew Stephen
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2011
Field of study

The success of protein structure modeling by homology requires an accurate sequence alignment between the query sequence and its structural template. However, sequence alignment methods based on dynamic programming (DP) are typically unable to generate accurate alignments for remote sequence homologs, thus limiting the applicability of modeling methods. A central problem is that the alignment that would produce the best structural model is generally not optimal, in the sense of having the highest DP score. Suboptimal alignment methods can be used to generate alternative alignments, but encounter difficulties given the enormous number of alignments that need to be considered. We present here a new suboptimal alignment method that relies heavily on the structure of the template. By initially aligning the query sequence to individual fragments in secondary structure elements (SSEs) and combining high-scoring fragments that pass basic tests for 'modelability', we can generate accurate alignments within a set of limited size. Chapter 1 introduces the field of protein structure prediction in general and the technique of homology modeling in particular. One subproblem of homology modeling -- the sequence to structure alignment of proteins -- is discussed in Chapter 2. Particular attention is given to descriptions of the size, density and redundancy of alignment space as well as an explanation of the dynamic programming technique and its strengths and weaknesses. The rationale for developing alternative alignment techniques and the unique difficulties of these methods are also discussed. Chapter 3 explains the methodologies of S4 -- the alternative alignment program we developed that is the main focus of this thesis. The process of finding alternative alignments with S4 involves several steps, but can be roughly divided into two main parts. First, the program looks for combinations of high-similarity fragments that pass basic rules for modelability. These 'fragment alignments' define regions of alignment space that can be searched more thoroughly with a statistical potential for a single representative for that region. The ensemble of alignments that is thus created needs to be evaluated for accuracy against the correct alignment. Current methods for doing so, as well as adjustments to those methods to better suit the realm of remote homology alignments, are discussed in Chapter 4. A novel measure for determining similarity between alignments, termed the inter-alignment distance (IAD) also is developed. This measure can be used to assess quality, but is also well-suited to finding redundant alignments within an ensemble. In Chapter 5, the results of testing S4 on a large set of targets from previous CASP experiments are analyzed. Comparisons to the optimal alignment as well as two standard alternative alignment methods, all of which use the same similarity score as S4, demonstrate that S4's improvement in accuracy is due to better sampling and filtering rather than more sophisticated scoring. Models made from S4 alignments are also shown to significantly improve upon those made from optimal alignments, especially for remote homologs. Finally, an example of a sequence to structure alignment offers an in depth explanation of how S4 finds correct alignments where the other methods do not. Chapter 6 describes a set of three experiments that paired S4 with the model evaluation tool ProsaII in a homology modeling pipeline. There were two primary objectives in this project. First, we wanted to test different methods for finding remote homologs that could serve as input to S4. And second, we evaluated the use of ProsaII as a method for discriminating between good and bad models, and thus also between homologous and non-homologous templates. The first two experiments are essentially blind searches for homologous sequences and structures. The third experiment takes remote templates returned by PSI-BLAST and uses S4 and ProsaII to find alignments and determine whether the template is a structural homolog. While S4 was able to find homologs in the blind searches, the alignment/model quality and level of discrimination was found to be higher when the input to the pipeline came from a set of structures produced by a template selection method. Finally, Chapter 7 discusses the consequences of this research and suggests future directions for its application

MRFalign: Protein Homology Detection through Alignment of Markov Random Fields

Author: Ma Jianzhu
Wang Sheng
Wang Zhiyong
Xu Jinbo
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

Sequence-based protein homology detection has been extensively studied and so far the most sensitive method is based upon comparison of protein sequence profiles, which are derived from multiple sequence alignment (MSA) of sequence homologs in a protein family. A sequence profile is usually represented as a position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This paper presents a new homology detection method MRFalign, consisting of three key components: 1) a Markov Random Fields (MRF) representation of a protein family; 2) a scoring function measuring similarity of two MRFs; and 3) an efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning two MRFs. Compared to HMM that can only model very short-range residue correlation, MRFs can model long-range residue interaction pattern and thus, encode information for the global 3D structure of a protein family. Consequently, MRF-MRF comparison for remote homology detection shall be much more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that MRFalign outperforms several popular HMM or PSSM-based methods in terms of both alignment accuracy and remote homology detection and that MRFalign works particularly well for mainly beta proteins. For example, tested on the benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM succeed on 48% and 52% of proteins, respectively, at superfamily level, and on 15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign succeeds on 57.3% and 42.5% of proteins at superfamily and fold level, respectively. This study implies that long-range residue interaction patterns are very helpful for sequence-based homology detection. The software is available for download at http://raptorx.uchicago.edu/download/.Comment: Accepted by both RECOMB 2014 and PLOS Computational Biolog

arXiv.org e-Print Archive

Directory of Open Access Journals

SATCHMO-JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction.

Author: Datta Ruchira S
Davidson John R
Hagopian Raffi
Jarvis Glen R
Samad Bushra
Sjölander Kimmen
Publication venue: eScholarship, University of California
Publication date: 29/04/2010
Field of study

We present the jump-start simultaneous alignment and tree construction using hidden Markov models (SATCHMO-JS) web server for simultaneous estimation of protein multiple sequence alignments (MSAs) and phylogenetic trees. The server takes as input a set of sequences in FASTA format, and outputs a phylogenetic tree and MSA; these can be viewed online or downloaded from the website. SATCHMO-JS is an extension of the SATCHMO algorithm, and employs a divide-and-conquer strategy to jump-start SATCHMO at a higher point in the phylogenetic tree, reducing the computational complexity of the progressive all-versus-all HMM-HMM scoring and alignment. Results on a benchmark dataset of 983 structurally aligned pairs from the PREFAB benchmark dataset show that SATCHMO-JS provides a statistically significant improvement in alignment accuracy over MUSCLE, Multiple Alignment using Fast Fourier Transform (MAFFT), ClustalW and the original SATCHMO algorithm. The SATCHMO-JS webserver is available at http://phylogenomics.berkeley.edu/satchmo-js. The datasets used in these experiments are available for download at http://phylogenomics.berkeley.edu/satchmo-js/supplementary/

Structure of the γ-D-glutamyl-L-diamino acid endopeptidase YkfC from Bacillus cereus in complex with L-Ala-γ-D-Glu: insights into substrate recognition by NlpC/P60 cysteine peptidases.

Dipeptidyl-peptidase VI from Bacillus sphaericus and YkfC from Bacillus subtilis have both previously been characterized as highly specific γ-D-glutamyl-L-diamino acid endopeptidases. The crystal structure of a YkfC ortholog from Bacillus cereus (BcYkfC) at 1.8 Å resolution revealed that it contains two N-terminal bacterial SH3 (SH3b) domains in addition to the C-terminal catalytic NlpC/P60 domain that is ubiquitous in the very large family of cell-wall-related cysteine peptidases. A bound reaction product (L-Ala-γ-D-Glu) enabled the identification of conserved sequence and structural signatures for recognition of L-Ala and γ-D-Glu and, therefore, provides a clear framework for understanding the substrate specificity observed in dipeptidyl-peptidase VI, YkfC and other NlpC/P60 domains in general. The first SH3b domain plays an important role in defining substrate specificity by contributing to the formation of the active site, such that only murein peptides with a free N-terminal alanine are allowed. A conserved tyrosine in the SH3b domain of the YkfC subfamily is correlated with the presence of a conserved acidic residue in the NlpC/P60 domain and both residues interact with the free amine group of the alanine. This structural feature allows the definition of a subfamily of NlpC/P60 enzymes with the same N-terminal substrate requirements, including a previously characterized cyanobacterial L-alanine-γ-D-glutamate endopeptidase that contains the two key components (an NlpC/P60 domain attached to an SH3b domain) for assembly of a YkfC-like active site

The Phyre2 web portal for protein modeling, prediction and analysis

Author: A González-Pérez
A Lobley
A Marchler-Bauer
A Roy
AA Canutescu
BR Jefferys
C Mao
Christopher M Yates
CM Yates
CT Porter
DT Jones
DT Jones
EV Koonin
G Fucile
IA Adzhubei
IW Davis
J Moult
J Söding
JA Capra
JJ Ward
K Arnold
LA Kelley
Lawrence A Kelley
M Higurashi
M Källberg
M Remmert
Mark N Wass
Michael J E Sternberg
MN Wass
N Siew
Ngak-Leng Sim
P Rotkiewicz
P Schmidtke
R Arjun
S Raman
SF Altschul
Stefans Mezulis
TE Lewis
X Wei
Publication venue: Springer
Publication date: 01/05/2015
Field of study

Phyre2 is a suite of tools available on the web to predict and analyze protein structure, function and mutations. The focus of Phyre2 is to provide biologists with a simple and intuitive interface to state-of-the-art protein bioinformatics tools. Phyre2 replaces Phyre, the original version of the server for which we previously published a paper in Nature Protocols. In this updated protocol, we describe Phyre2, which uses advanced remote homology detection methods to build 3D models, predict ligand binding sites and analyze the effect of amino acid variants (e.g., nonsynonymous SNPs (nsSNPs)) for a user's protein sequence. Users are guided through results by a simple interface at a level of detail they determine. This protocol will guide users from submitting a protein sequence to interpreting the secondary and tertiary structure of their models, their domain composition and model quality. A range of additional available tools is described to find a protein structure in a genome, to submit large number of sequences at once and to automatically run weekly searches for proteins that are difficult to model. The server is available at http://www.sbg.bio.ic.ac.uk/phyre2. A typical structure prediction will be returned between 30 min and 2 h after submission

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Kent Academic Repository

Spiral - Imperial College Digital Repository

A multi-species functional embedding integrating sequence and network structure

Author: Cannistra Anthony
Crovella Mark
Fan Jason
Fried Inbar
Hescott Benjamin
Leiserson Mark D. M.
Lim Tim
Schaffner Thomas
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 01/04/2018
Field of study

A key challenge to transferring knowledge between species is that different species have fundamentally different genetic architectures. Initial computational approaches to transfer knowledge across species have relied on measures of heredity such as genetic homology, but these approaches suffer from limitations. First, only a small subset of genes have homologs, limiting the amount of knowledge that can be transferred, and second, genes change or repurpose functions, complicating the transfer of knowledge. Many approaches address this problem by expanding the notion of homology by leveraging high-throughput genomic and proteomic measurements, such as through network alignment. In this work, we take a new approach to transferring knowledge across species by expanding the notion of homology through explicit measures of functional similarity between proteins in different species. Specifically, our kernel-based method, HANDL (Homology Assessment across Networks using Diffusion and Landmarks), integrates sequence and network structure to create a functional embedding in which proteins from different species are embedded in the same vector space. We show that inner products in this space and the vectors themselves capture functional similarity across species, and are useful for a variety of functional tasks. We perform the first whole-genome method for predicting phenologs, generating many that were previously identified, but also predicting new phenologs supported from the biological literature. We also demonstrate the HANDL embedding captures pairwise gene function, in that gene pairs with synthetic lethal interactions are significantly separated in HANDL space, and the direction of separation is conserved across species. Software for the HANDL algorithm is available at http://bit.ly/lrgr-handl.Published versio

Boston University Institutional Repository (OpenBU)

Structural basis of severe acute respiratory syndrome coronavirus ADP-ribose-1''-phosphate dephosphorylation by a conserved domain of nsP3.

Author: Buchmeier Michael J
Clayton Tom
Griffith Mark
Joseph Jeremiah S
Kuhn Peter
Moy Kin
Neuman Benjamin W
Saikatendu Kumar Singh
Stevens Raymond C
Subramanian Vanitha
Velasquez Jeffrey
Publication venue: eScholarship, University of California
Publication date: 01/11/2005
Field of study

The crystal structure of a conserved domain of nonstructural protein 3 (nsP3) from severe acute respiratory syndrome coronavirus (SARS-CoV) has been solved by single-wavelength anomalous dispersion to 1.4 A resolution. The structure of this "X" domain, seen in many single-stranded RNA viruses, reveals a three-layered alpha/beta/alpha core with a macro-H2A-like fold. The putative active site is a solvent-exposed cleft that is conserved in its three structural homologs, yeast Ymx7, Archeoglobus fulgidus AF1521, and Er58 from E. coli. Its sequence is similar to yeast YBR022W (also known as Poa1P), a known phosphatase that acts on ADP-ribose-1''-phosphate (Appr-1''-p). The SARS nsP3 domain readily removes the 1'' phosphate group from Appr-1''-p in in vitro assays, confirming its phosphatase activity. Sequence and structure comparison of all known macro-H2A domains combined with available functional data suggests that proteins of this superfamily form an emerging group of nucleotide phosphatases that dephosphorylate Appr-1''-p

Elsevier - Publisher Connector

Public Library of Science (PLOS)

Exploration of Uncharted Regions of the Protein Universe

Determination of first protein structures, from hundreds of families of unknown function, have shown that divergence, rather than novelty, is the dominant force that shapes the evolution of the protein universe

Directory of Open Access Journals