942 research outputs found

    Probabilistic protein homology modeling

    Get PDF
    Searching sequence databases and building 3D models for proteins are important tasks for biologists. When the structure of a query protein is given, its function can be inferred. However, experimental methods for structure prediction are both expensive and time consuming. Fully automatic homology modeling refers to building a 3D model for a query sequence from an alignment to related homologous proteins with known structure (templates) by a computer. Current prediction servers can provide accurate models within a few hours to days. Our group has developed HHpred, which is one of the top performing structure prediction servers in the field. In general, homology based structure modeling consists of four steps: (1) finding homologous templates in a database, (2) selecting and (3) aligning templates to the query, (4) building a 3D model based on the alignment. In part one of this thesis, we will present improvements of step (2) and (4). Specifically, homology modeling has been shown to work best when multiple templates are selected instead of only a single one. Yet, current servers are using rather ad-hoc approaches to combine information from multiple templates. We provide a rigorous statistical framework for multi-template homology modeling. Given an alignment, we employ Modeller to calculate the most probable structure for a query. The 3D model is obtained by optimally satisfying spatial restraints derived from the alignment and expressed as probability density functions. We find that the query’s atomic distance restraints can be accurately described by two-component Gaussian mixtures. Moreover, we derive statistical weights to quantify the redundancy among related templates. This allows us to apply the standard rules of probability theory to combine restraints from several templates. Together with a heuristic template selection strategy, we have implemented this approach within HHpred and could significantly improve model quality. Furthermore, we took part in CASP, a community wide competition for structure prediction, where we were ranked first in template based modeling and, at the same time, were more than 450 times faster than all other top servers. Homology modeling heavily relies on detecting and correctly aligning templates to the query sequence (step (1) and (3) from above). But remote homologies are difficult to detect and hard to align on a pure sequence level. Hence, modern tools are based on profiles instead of sequences. A profile summarizes the evolutionary history of a given sequence and consists of position specific amino acid probabilities for each residue. In addition to the similarity score between profile columns, most methods use extra terms that compare 1D structural properties such as secondary structure or solvent accessibility. These can be predicted from local profile windows. In the second part of this thesis, we develop a new score that is independent of any predefined structural property. For this purpose, we learn a library of 32 profile patterns that are most conserved in alignments of remotely homologous, structurally aligned proteins. Each so called “context state” in the library consists of a 13-residue sequence profile. We integrate the new context score into our Hmm-Hmm alignment tool HHsearch and improve especially the sensitivity and precision of difficult pairwise alignments significantly. Taken together, we introduced probabilistic methods to improve all four main steps in homology based structure prediction

    Recognition of short functional motifs in protein sequences

    Get PDF
    The main goal of this study was to develop a method for computational de novo prediction of short linear motifs (SLiMs) in protein sequences that would provide advantages over existing solutions for the users. The users are typically biological laboratory researchers, who want to elucidate the function of a protein that is possibly mediated by a short motif. Such a process can be subcellular localization, secretion, post-translational modification or degradation of proteins. Conducting such studies only with experimental techniques is often associated with high costs and risks of uncertainty. Preliminary prediction of putative motifs with computational methods, them being fast and much less expensive, provides possibilities for generating hypotheses and therefore, more directed and efficient planning of experiments. To meet this goal, I have developed HH-MOTiF – a web-based tool for de novo discovery of SLiMs in a set of protein sequences. While working on the project, I have also detected patterns in sequence properties of certain SLiMs that make their de novo prediction easier. As some of these patterns are not yet described in the literature, I am sharing them in this thesis. While evaluating and comparing motif prediction results, I have identified conceptual gaps in theoretical studies, as well as existing practical solutions for comparing two sets of positional data annotating the same set of biological sequences. To close this gap and to be able to carry out in-depth performance analyses of HH-MOTiF in comparison to other predictors, I have developed a corresponding statistical method, SLALOM (for StatisticaL Analysis of Locus Overlap Method). It is currently available as a standalone command line tool

    High Resolution Models of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions

    Get PDF
    Accurately modeling the DNA sequence preferences of transcription factors (TFs), and using these models to predict in vivo genomic binding sites for TFs, are key pieces in deciphering the regulatory code. These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices (PSSMs), which may match large numbers of sites and produce an unreliable list of target genes. Recently, protein binding microarray (PBM) experiments have emerged as a new source of high resolution data on in vitro TF binding specificities. PBM data has been analyzed either by estimating PSSMs or via rank statistics on probe intensities, so that individual sequence patterns are assigned enrichment scores (E-scores). This representation is informative but unwieldy because every TF is assigned a list of thousands of scored sequence patterns. Meanwhile, high-resolution in vivo TF occupancy data from ChIP-seq experiments is also increasingly available. We have developed a flexible discriminative framework for learning TF binding preferences from high resolution in vitro and in vivo data. We first trained support vector regression (SVR) models on PBM data to learn the mapping from probe sequences to binding intensities. We used a novel -mer based string kernel called the di-mismatch kernel to represent probe sequence similarities. The SVR models are more compact than E-scores, more expressive than PSSMs, and can be readily used to scan genomics regions to predict in vivo occupancy. Using a large data set of yeast and mouse TFs, we found that our SVR models can better predict probe intensity than the E-score method or PBM-derived PSSMs. Moreover, by using SVRs to score yeast, mouse, and human genomic regions, we were better able to predict genomic occupancy as measured by ChIP-chip and ChIP-seq experiments. Finally, we found that by training kernel-based models directly on ChIP-seq data, we greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, we could identify cofactors and disambiguate direct and indirect binding

    Accelerated Profile HMM Searches

    Get PDF
    Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call “sparse rescaling”. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches

    Probabilistic protein homology modeling

    Get PDF
    Searching sequence databases and building 3D models for proteins are important tasks for biologists. When the structure of a query protein is given, its function can be inferred. However, experimental methods for structure prediction are both expensive and time consuming. Fully automatic homology modeling refers to building a 3D model for a query sequence from an alignment to related homologous proteins with known structure (templates) by a computer. Current prediction servers can provide accurate models within a few hours to days. Our group has developed HHpred, which is one of the top performing structure prediction servers in the field. In general, homology based structure modeling consists of four steps: (1) finding homologous templates in a database, (2) selecting and (3) aligning templates to the query, (4) building a 3D model based on the alignment. In part one of this thesis, we will present improvements of step (2) and (4). Specifically, homology modeling has been shown to work best when multiple templates are selected instead of only a single one. Yet, current servers are using rather ad-hoc approaches to combine information from multiple templates. We provide a rigorous statistical framework for multi-template homology modeling. Given an alignment, we employ Modeller to calculate the most probable structure for a query. The 3D model is obtained by optimally satisfying spatial restraints derived from the alignment and expressed as probability density functions. We find that the query’s atomic distance restraints can be accurately described by two-component Gaussian mixtures. Moreover, we derive statistical weights to quantify the redundancy among related templates. This allows us to apply the standard rules of probability theory to combine restraints from several templates. Together with a heuristic template selection strategy, we have implemented this approach within HHpred and could significantly improve model quality. Furthermore, we took part in CASP, a community wide competition for structure prediction, where we were ranked first in template based modeling and, at the same time, were more than 450 times faster than all other top servers. Homology modeling heavily relies on detecting and correctly aligning templates to the query sequence (step (1) and (3) from above). But remote homologies are difficult to detect and hard to align on a pure sequence level. Hence, modern tools are based on profiles instead of sequences. A profile summarizes the evolutionary history of a given sequence and consists of position specific amino acid probabilities for each residue. In addition to the similarity score between profile columns, most methods use extra terms that compare 1D structural properties such as secondary structure or solvent accessibility. These can be predicted from local profile windows. In the second part of this thesis, we develop a new score that is independent of any predefined structural property. For this purpose, we learn a library of 32 profile patterns that are most conserved in alignments of remotely homologous, structurally aligned proteins. Each so called “context state” in the library consists of a 13-residue sequence profile. We integrate the new context score into our Hmm-Hmm alignment tool HHsearch and improve especially the sensitivity and precision of difficult pairwise alignments significantly. Taken together, we introduced probabilistic methods to improve all four main steps in homology based structure prediction

    Comparative analyses of aryl hydrocarbon receptor structure and function in marine mammals

    Get PDF
    Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the Massachusetts Institute of Technology and the Woods Hole Oceanographic Institution February 2007Marine mammals possess high body burdens of persistent organic pollutants, including PCBs and dioxin-like compounds (DLC). Chronic environmental or dietary exposure to these chemicals can disrupt the function of reproductive and immune systems, as well as cause developmental defects in laboratory animals. The aryl hydrocarbon receptor (AHR) is a ligand-activated transcription factor, mediating the expression of a suite of genes in response to exposure to DLC and structurally related chemicals. Species-specific differences in AHR structure can affect an organism’s susceptibility to the effects of DLC. The structures and functions of several cetacean AHRs were investigated using in vitro molecular cloning and biochemical techniques. Using a novel combination of remote biopsy and molecular cloning methods, RNA was extracted from small integument samples from living North Atlantic right whales to identify the cDNA sequence for AHR and other genes of physiological importance. Biopsy-derived RNA was found to be of higher quality than RNA extracted from stranded cetaceans, and proved a good source for identifying cDNA sequences for expressed genes. The molecular sequences, binding constants, and transcriptional activities for North Atlantic right whale and humpback whale AHRs cDNAs were determined using in vitro and cell culture methods. Whale AHRs are capable of specifically binding dioxin and initiating transcription of reporter genes. The properties of these AHRs were compared with those from other mammalian species, including human, mouse, hamster, and guinea pig, and other novel marine mammal AHRs, using biochemical, phylogenetic, and homology modeling analyses. The relative binding affinities for some marine mammal AHRs fall between those for the high-affinity mouse AHRb-1 and the lower affinity human AHR. Species-specific variability in two regions of the AHR ligand binding domain were identified as having the greatest potential impact on AHR tertiary structure, yet does not sufficiently explain differences observed in ligand binding assays. Additional studies are necessary to link exposure to environmental contaminants with potential reproductive effects in marine mammals, especially via interactions with steroid hormone receptor pathways.NOAA National Sea Grant College Program, Grant No. NA16RG2273, Grant No. NA86RG0075, NOAA Right Whale Grants Program, Grant No. NA03NMF4720475, American Association of University Women, American Dissertation Fellowshi

    Computational Approaches to Drug Profiling and Drug-Protein Interactions

    Get PDF
    Despite substantial increases in R&D spending within the pharmaceutical industry, denovo drug design has become a time-consuming endeavour. High attrition rates led to a long period of stagnation in drug approvals. Due to the extreme costs associated with introducing a drug to the market, locating and understanding the reasons for clinical failure is key to future productivity. As part of this PhD, three main contributions were made in this respect. First, the web platform, LigNFam enables users to interactively explore similarity relationships between ‘drug like’ molecules and the proteins they bind. Secondly, two deep-learning-based binding site comparison tools were developed, competing with the state-of-the-art over benchmark datasets. The models have the ability to predict offtarget interactions and potential candidates for target-based drug repurposing. Finally, the open-source ScaffoldGraph software was presented for the analysis of hierarchical scaffold relationships and has already been used in multiple projects, including integration into a virtual screening pipeline to increase the tractability of ultra-large screening experiments. Together, and with existing tools, the contributions made will aid in the understanding of drug-protein relationships, particularly in the fields of off-target prediction and drug repurposing, helping to design better drugs faster

    Isolation and Genomic Analysis of the Cetacean Y-chromosome

    Get PDF
    The male-specific mammalian Y-chromosome represents a powerful tool for studying malemediated gene flow and genome evolution. Here it was possible to identify 7 polymorphic microsatellites for the first time in an odontocete species, using a combination of cell culture, cytogenetics and molecular approaches. Initially, the development of an efficient and repeatable methodology for obtaining a growing lymphocyte culture that facilitated the isolation of individual chromosomes is described. Flow karyotypic characterization and isolation of individual chromosomes via flow sorting or microdissection is reported for the killer whale (Orcinus orca). Microdissected Y-chromosomes from the killer whale and bottlenose dolphin (Tursiops truncatus) were screened for sequences containing microsatellite motifs. 15 and 10 male-specific microsatellites were identified from the killer whale and bottlenose dolphin, respectively. Additional microsatellite loci were identified from previously published fin whale Y-chromosome sequence. 6 markers designed from heterologous sequences amplified from sperm whales (Physeter macrocephalus), were also screened for variation. All 31 markers were monomorphic in the bottlenose dolphin, only 2 loci showed 2 variants in the killer whale and 7 were polymorphic in the sperm whale. In addition 162 anonymous regions of the Y-chromosome, isolated from the delphinid species were used to characterize the comparative composition of the ‘Y’ relative to the autosomes in these species. Characteristics are discussed in the context of the genome as a whole, species-specific history and with reference to the expected patterns of mammalian Y-chromosome evolution

    A MOLECULAR APPROACH TO CALANUS (COPEPODA:CALANOIDA) DEVELOPMENT AND SYSTEMATICS

    Get PDF
    Production and recruitment measurements in marine copepods of the genus Calanus have been addressed via the study of genes involved in early embryogenesis. The first sequence from a Calanus helgolandicus (C. helgolandicus) developmental gene (Cal-Antp) has been cloned by screening a C. helgolandicus genomic library with a homologous Calanus homeobox probe. Sequencing of an isolated and sub-cloned fragment of this gene, plus further analysis by Inverse Polymerase Chain Reaction (IVPCR), has shown it to be homologous with other Antennapedia homeobox genes. The temporal expression of Cal-Antp was analysed through its messenger RNA (mRNA) complement by Reverse Transcription Polymerase Chain Reaction (RT-PCR). The gene was expressed in tissue taken from eggs over 18 hours old, and in nauplii and copepodite stages, but no expression was detected in eggs less than 18 hours old or adult tissue. Three further homeobox-containing genes have been identified and analysed through their expression in C. helgolandicus eggs. Two of these are caudal homologues, and the third is homologous to the Antennapedia class of genes. The C. helgolandicus developmental gene sequence data provides a means of developing probes to monitor the temporal expression of such genes and their responses to environmental influence. The applicability of such probes to the investigation of key production and recruitment processes, including egg viability measurement, is discussed. A relatively simple and cost effective method has been developed to identify the four Calanus species common to the North Atlantic. This system involves the PCR amplification of a region of the mitochondrial rRNA gene without prior purification of the DNA, followed by Restriction Fragment Length Polymorphism (RFLP) analysis of the amplified product. The versatility of the method is demonstrated by the unambiguous identification to species of any life stage, from egg to adult, and of any individual body parts. The molecular identification technique has for the first time shown the unexpected presence of three different Calanus species in Lurefjorden, Norway and has proved to be consistently accurate for all individuals tested including geographically distinct conspecific populations.Plymouth Marine Laborator
    corecore