942 research outputs found
Probabilistic protein homology modeling
Searching sequence databases and building 3D models for proteins are important tasks
for biologists. When the structure of a query protein is given, its function can be inferred. However, experimental methods for structure prediction are both expensive and
time consuming. Fully automatic homology modeling refers to building a 3D model for
a query sequence from an alignment to related homologous proteins with known structure (templates) by a computer. Current prediction servers can provide accurate models
within a few hours to days. Our group has developed HHpred, which is one of the top
performing structure prediction servers in the field.
In general, homology based structure modeling consists of four steps: (1) finding homologous templates in a database, (2) selecting and (3) aligning templates to the query, (4)
building a 3D model based on the alignment.
In part one of this thesis, we will present improvements of step (2) and (4). Specifically,
homology modeling has been shown to work best when multiple templates are selected
instead of only a single one. Yet, current servers are using rather ad-hoc approaches to
combine information from multiple templates. We provide a rigorous statistical framework for multi-template homology modeling. Given an alignment, we employ Modeller to calculate the most probable structure for a query. The 3D model is obtained
by optimally satisfying spatial restraints derived from the alignment and expressed as
probability density functions. We find that the query’s atomic distance restraints can
be accurately described by two-component Gaussian mixtures. Moreover, we derive statistical weights to quantify the redundancy among related templates. This allows us to
apply the standard rules of probability theory to combine restraints from several templates. Together with a heuristic template selection strategy, we have implemented this
approach within HHpred and could significantly improve model quality. Furthermore,
we took part in CASP, a community wide competition for structure prediction, where
we were ranked first in template based modeling and, at the same time, were more than
450 times faster than all other top servers.
Homology modeling heavily relies on detecting and correctly aligning templates to the
query sequence (step (1) and (3) from above). But remote homologies are difficult to
detect and hard to align on a pure sequence level. Hence, modern tools are based on
profiles instead of sequences. A profile summarizes the evolutionary history of a given
sequence and consists of position specific amino acid probabilities for each residue. In
addition to the similarity score between profile columns, most methods use extra terms
that compare 1D structural properties such as secondary structure or solvent accessibility. These can be predicted from local profile windows.
In the second part of this thesis, we develop a new score that is independent of any predefined structural property. For this purpose, we learn a library of 32 profile patterns that
are most conserved in alignments of remotely homologous, structurally aligned proteins.
Each so called “context state” in the library consists of a 13-residue sequence profile.
We integrate the new context score into our Hmm-Hmm alignment tool HHsearch and
improve especially the sensitivity and precision of difficult pairwise alignments significantly.
Taken together, we introduced probabilistic methods to improve all four main steps in
homology based structure prediction
Recognition of short functional motifs in protein sequences
The main goal of this study was to develop a method for computational de novo prediction of short linear motifs (SLiMs) in protein sequences that would provide advantages over existing solutions for the users. The users are typically biological laboratory researchers, who want to elucidate the function of a protein that is possibly mediated by a short motif. Such a process can be subcellular localization, secretion, post-translational modification or degradation of proteins. Conducting such studies only with experimental techniques is often associated with high costs and risks of uncertainty. Preliminary prediction of putative motifs with computational methods, them being fast and much less expensive, provides possibilities for generating hypotheses and therefore, more directed and efficient planning of experiments. To meet this goal, I have developed HH-MOTiF – a web-based tool for de novo discovery of SLiMs in a set of protein sequences.
While working on the project, I have also detected patterns in sequence properties of certain SLiMs that make their de novo prediction easier. As some of these patterns are not yet described in the literature, I am sharing them in this thesis.
While evaluating and comparing motif prediction results, I have identified conceptual gaps in theoretical studies, as well as existing practical solutions for comparing two sets of positional data annotating the same set of biological sequences. To close this gap and to be able to carry out in-depth performance analyses of HH-MOTiF in comparison to other predictors, I have developed a corresponding statistical method, SLALOM (for StatisticaL Analysis of Locus Overlap Method). It is currently available as a standalone command line tool
High Resolution Models of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions
Accurately modeling the DNA sequence preferences of transcription factors (TFs), and using these models to predict in vivo genomic binding sites for TFs, are key pieces in deciphering the regulatory code. These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices (PSSMs), which may match large numbers of sites and produce an unreliable list of target genes. Recently, protein binding microarray (PBM) experiments have emerged as a new source of high resolution data on in vitro TF binding specificities. PBM data has been analyzed either by estimating PSSMs or via rank statistics on probe intensities, so that individual sequence patterns are assigned enrichment scores (E-scores). This representation is informative but unwieldy because every TF is assigned a list of thousands of scored sequence patterns. Meanwhile, high-resolution in vivo TF occupancy data from ChIP-seq experiments is also increasingly available. We have developed a flexible discriminative framework for learning TF binding preferences from high resolution in vitro and in vivo data. We first trained support vector regression (SVR) models on PBM data to learn the mapping from probe sequences to binding intensities. We used a novel -mer based string kernel called the di-mismatch kernel to represent probe sequence similarities. The SVR models are more compact than E-scores, more expressive than PSSMs, and can be readily used to scan genomics regions to predict in vivo occupancy. Using a large data set of yeast and mouse TFs, we found that our SVR models can better predict probe intensity than the E-score method or PBM-derived PSSMs. Moreover, by using SVRs to score yeast, mouse, and human genomic regions, we were better able to predict genomic occupancy as measured by ChIP-chip and ChIP-seq experiments. Finally, we found that by training kernel-based models directly on ChIP-seq data, we greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, we could identify cofactors and disambiguate direct and indirect binding
Accelerated Profile HMM Searches
Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call “sparse rescaling”. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches
Probabilistic protein homology modeling
Searching sequence databases and building 3D models for proteins are important tasks
for biologists. When the structure of a query protein is given, its function can be inferred. However, experimental methods for structure prediction are both expensive and
time consuming. Fully automatic homology modeling refers to building a 3D model for
a query sequence from an alignment to related homologous proteins with known structure (templates) by a computer. Current prediction servers can provide accurate models
within a few hours to days. Our group has developed HHpred, which is one of the top
performing structure prediction servers in the field.
In general, homology based structure modeling consists of four steps: (1) finding homologous templates in a database, (2) selecting and (3) aligning templates to the query, (4)
building a 3D model based on the alignment.
In part one of this thesis, we will present improvements of step (2) and (4). Specifically,
homology modeling has been shown to work best when multiple templates are selected
instead of only a single one. Yet, current servers are using rather ad-hoc approaches to
combine information from multiple templates. We provide a rigorous statistical framework for multi-template homology modeling. Given an alignment, we employ Modeller to calculate the most probable structure for a query. The 3D model is obtained
by optimally satisfying spatial restraints derived from the alignment and expressed as
probability density functions. We find that the query’s atomic distance restraints can
be accurately described by two-component Gaussian mixtures. Moreover, we derive statistical weights to quantify the redundancy among related templates. This allows us to
apply the standard rules of probability theory to combine restraints from several templates. Together with a heuristic template selection strategy, we have implemented this
approach within HHpred and could significantly improve model quality. Furthermore,
we took part in CASP, a community wide competition for structure prediction, where
we were ranked first in template based modeling and, at the same time, were more than
450 times faster than all other top servers.
Homology modeling heavily relies on detecting and correctly aligning templates to the
query sequence (step (1) and (3) from above). But remote homologies are difficult to
detect and hard to align on a pure sequence level. Hence, modern tools are based on
profiles instead of sequences. A profile summarizes the evolutionary history of a given
sequence and consists of position specific amino acid probabilities for each residue. In
addition to the similarity score between profile columns, most methods use extra terms
that compare 1D structural properties such as secondary structure or solvent accessibility. These can be predicted from local profile windows.
In the second part of this thesis, we develop a new score that is independent of any predefined structural property. For this purpose, we learn a library of 32 profile patterns that
are most conserved in alignments of remotely homologous, structurally aligned proteins.
Each so called “context state” in the library consists of a 13-residue sequence profile.
We integrate the new context score into our Hmm-Hmm alignment tool HHsearch and
improve especially the sensitivity and precision of difficult pairwise alignments significantly.
Taken together, we introduced probabilistic methods to improve all four main steps in
homology based structure prediction
Comparative analyses of aryl hydrocarbon receptor structure and function in marine mammals
Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the Massachusetts Institute of Technology and the Woods Hole Oceanographic Institution February 2007Marine mammals possess high body burdens of persistent organic pollutants,
including PCBs and dioxin-like compounds (DLC). Chronic environmental or
dietary exposure to these chemicals can disrupt the function of reproductive and
immune systems, as well as cause developmental defects in laboratory animals.
The aryl hydrocarbon receptor (AHR) is a ligand-activated transcription factor,
mediating the expression of a suite of genes in response to exposure to DLC and
structurally related chemicals. Species-specific differences in AHR structure can
affect an organism’s susceptibility to the effects of DLC. The structures and
functions of several cetacean AHRs were investigated using in vitro molecular
cloning and biochemical techniques. Using a novel combination of remote
biopsy and molecular cloning methods, RNA was extracted from small
integument samples from living North Atlantic right whales to identify the cDNA
sequence for AHR and other genes of physiological importance. Biopsy-derived
RNA was found to be of higher quality than RNA extracted from stranded
cetaceans, and proved a good source for identifying cDNA sequences for
expressed genes. The molecular sequences, binding constants, and
transcriptional activities for North Atlantic right whale and humpback whale AHRs
cDNAs were determined using in vitro and cell culture methods. Whale AHRs
are capable of specifically binding dioxin and initiating transcription of reporter
genes. The properties of these AHRs were compared with those from other
mammalian species, including human, mouse, hamster, and guinea pig, and
other novel marine mammal AHRs, using biochemical, phylogenetic, and
homology modeling analyses. The relative binding affinities for some marine
mammal AHRs fall between those for the high-affinity mouse AHRb-1 and the
lower affinity human AHR. Species-specific variability in two regions of the AHR
ligand binding domain were identified as having the greatest potential impact on
AHR tertiary structure, yet does not sufficiently explain differences observed in
ligand binding assays. Additional studies are necessary to link exposure to
environmental contaminants with potential reproductive effects in marine
mammals, especially via interactions with steroid hormone receptor pathways.NOAA National Sea Grant College Program, Grant No. NA16RG2273, Grant No. NA86RG0075, NOAA Right Whale Grants Program, Grant No. NA03NMF4720475, American Association of University Women, American Dissertation Fellowshi
Computational Approaches to Drug Profiling and Drug-Protein Interactions
Despite substantial increases in R&D spending within the pharmaceutical industry, denovo drug design has become a time-consuming endeavour. High attrition rates led to a
long period of stagnation in drug approvals. Due to the extreme costs associated with
introducing a drug to the market, locating and understanding the reasons for clinical failure
is key to future productivity. As part of this PhD, three main contributions were made in
this respect. First, the web platform, LigNFam enables users to interactively explore
similarity relationships between ‘drug like’ molecules and the proteins they bind. Secondly,
two deep-learning-based binding site comparison tools were developed, competing with
the state-of-the-art over benchmark datasets. The models have the ability to predict offtarget interactions and potential candidates for target-based drug repurposing. Finally, the
open-source ScaffoldGraph software was presented for the analysis of hierarchical scaffold
relationships and has already been used in multiple projects, including integration into a
virtual screening pipeline to increase the tractability of ultra-large screening experiments.
Together, and with existing tools, the contributions made will aid in the understanding of
drug-protein relationships, particularly in the fields of off-target prediction and drug
repurposing, helping to design better drugs faster
Isolation and Genomic Analysis of the Cetacean Y-chromosome
The male-specific mammalian Y-chromosome represents a powerful tool for studying malemediated
gene flow and genome evolution. Here it was possible to identify 7 polymorphic
microsatellites for the first time in an odontocete species, using a combination of cell culture,
cytogenetics and molecular approaches. Initially, the development of an efficient and
repeatable methodology for obtaining a growing lymphocyte culture that facilitated the
isolation of individual chromosomes is described. Flow karyotypic characterization and
isolation of individual chromosomes via flow sorting or microdissection is reported for the killer
whale (Orcinus orca). Microdissected Y-chromosomes from the killer whale and bottlenose
dolphin (Tursiops truncatus) were screened for sequences containing microsatellite motifs. 15
and 10 male-specific microsatellites were identified from the killer whale and bottlenose
dolphin, respectively. Additional microsatellite loci were identified from previously published
fin whale Y-chromosome sequence. 6 markers designed from heterologous sequences
amplified from sperm whales (Physeter macrocephalus), were also screened for variation. All
31 markers were monomorphic in the bottlenose dolphin, only 2 loci showed 2 variants in the
killer whale and 7 were polymorphic in the sperm whale. In addition 162 anonymous regions of
the Y-chromosome, isolated from the delphinid species were used to characterize the
comparative composition of the ‘Y’ relative to the autosomes in these species. Characteristics
are discussed in the context of the genome as a whole, species-specific history and with
reference to the expected patterns of mammalian Y-chromosome evolution
A MOLECULAR APPROACH TO CALANUS (COPEPODA:CALANOIDA) DEVELOPMENT AND SYSTEMATICS
Production and recruitment measurements in marine copepods of the genus
Calanus have been addressed via the study of genes involved in early embryogenesis. The
first sequence from a Calanus helgolandicus (C. helgolandicus) developmental gene (Cal-Antp)
has been cloned by screening a C. helgolandicus genomic library with a homologous
Calanus homeobox probe. Sequencing of an isolated and sub-cloned fragment of this
gene, plus further analysis by Inverse Polymerase Chain Reaction (IVPCR), has shown it
to be homologous with other Antennapedia homeobox genes. The temporal expression of
Cal-Antp was analysed through its messenger RNA (mRNA) complement by Reverse
Transcription Polymerase Chain Reaction (RT-PCR). The gene was expressed in tissue
taken from eggs over 18 hours old, and in nauplii and copepodite stages, but no expression
was detected in eggs less than 18 hours old or adult tissue. Three further homeobox-containing
genes have been identified and analysed through their expression in C.
helgolandicus eggs. Two of these are caudal homologues, and the third is homologous to
the Antennapedia class of genes. The C. helgolandicus developmental gene sequence data
provides a means of developing probes to monitor the temporal expression of such genes
and their responses to environmental influence. The applicability of such probes to the
investigation of key production and recruitment processes, including egg viability
measurement, is discussed.
A relatively simple and cost effective method has been developed to identify the
four Calanus species common to the North Atlantic. This system involves the PCR
amplification of a region of the mitochondrial rRNA gene without prior purification of the
DNA, followed by Restriction Fragment Length Polymorphism (RFLP) analysis of the
amplified product. The versatility of the method is demonstrated by the unambiguous
identification to species of any life stage, from egg to adult, and of any individual body
parts. The molecular identification technique has for the first time shown the unexpected
presence of three different Calanus species in Lurefjorden, Norway and has proved to be
consistently accurate for all individuals tested including geographically distinct conspecific
populations.Plymouth Marine Laborator
- …