734 research outputs found

    CodonTest: Modeling Amino Acid Substitution Preferences in Coding Sequences

    Get PDF
    Codon models of evolution have facilitated the interpretation of selective forces operating on genomes. These models, however, assume a single rate of non-synonymous substitution irrespective of the nature of amino acids being exchanged. Recent developments have shown that models which allow for amino acid pairs to have independent rates of substitution offer improved fit over single rate models. However, these approaches have been limited by the necessity for large alignments in their estimation. An alternative approach is to assume that substitution rates between amino acid pairs can be subdivided into rate classes, dependent on the information content of the alignment. However, given the combinatorially large number of such models, an efficient model search strategy is needed. Here we develop a Genetic Algorithm (GA) method for the estimation of such models. A GA is used to assign amino acid substitution pairs to a series of rate classes, where is estimated from the alignment. Other parameters of the phylogenetic Markov model, including substitution rates, character frequencies and branch lengths are estimated using standard maximum likelihood optimization procedures. We apply the GA to empirical alignments and show improved model fit over existing models of codon evolution. Our results suggest that current models are poor approximations of protein evolution and thus gene and organism specific multi-rate models that incorporate amino acid substitution biases are preferred. We further anticipate that the clustering of amino acid substitution rates into classes will be biologically informative, such that genes with similar functions exhibit similar clustering, and hence this clustering will be useful for the evolutionary fingerprinting of genes

    Predicting DNA-Binding Specificities of Eukaryotic Transcription Factors

    Get PDF
    Today, annotated amino acid sequences of more and more transcription factors (TFs) are readily available. Quantitative information about their DNA-binding specificities, however, are hard to obtain. Position frequency matrices (PFMs), the most widely used models to represent binding specificities, are experimentally characterized only for a small fraction of all TFs. Even for some of the most intensively studied eukaryotic organisms (i.e., human, rat and mouse), roughly one-sixth of all proteins with annotated DNA-binding domain have been characterized experimentally. Here, we present a new method based on support vector regression for predicting quantitative DNA-binding specificities of TFs in different eukaryotic species. This approach estimates a quantitative measure for the PFM similarity of two proteins, based on various features derived from their protein sequences. The method is trained and tested on a dataset containing 1 239 TFs with known DNA-binding specificity, and used to predict specific DNA target motifs for 645 TFs with high accuracy

    First insights into the microbiology of three antarctic briny systems of the northern Victoria land

    Get PDF
    Different polar environments (lakes and glaciers), also in Antarctica, encapsulate brine pools characterized by a unique combination of extreme conditions, mainly in terms of high salinity and low temperature. Since 2014, we have been focusing our attention on the microbiology of brine pockets from three lakes in the Northern Victoria Land (NVL), lying in the Tarn Flat (TF) and Boulder Clay (BC) areas. The microbial communities have been analyzed for community structure by next generation sequencing, extracellular enzyme activities, metabolic potentials, and microbial abundances. In this study, we aim at reconsidering all available data to analyze the influence exerted by environmental parameters on the community composition and activities. Additionally, the prediction of metabolic functions was attempted by the phylogenetic investigation of communities by reconstruction of unobserved states (PICRUSt2) tool, highlighting that prokaryotic communities were presumably involved in methane metabolism, aromatic compound biodegradation, and organic compound (proteins, polysaccharides, and phosphates) decomposition. The analyzed cryoenvironments were different in terms of prokaryotic diversity, abundance, and retrieved metabolic pathways. By the analysis of DNA sequences, common operational taxonomic units ranged from 2.2% to 22.0%. The bacterial community was dominated by Bacteroidetes. In both BC and TF brines, sequences of the most thermally tolerant and methanogenic Archaea were detected, some of them related to hyperthermophiles

    Computationally Comparing Biological Networks and Reconstructing Their Evolution

    Get PDF
    Biological networks, such as protein-protein interaction, regulatory, or metabolic networks, provide information about biological function, beyond what can be gleaned from sequence alone. Unfortunately, most computational problems associated with these networks are NP-hard. In this dissertation, we develop algorithms to tackle numerous fundamental problems in the study of biological networks. First, we present a system for classifying the binding affinity of peptides to a diverse array of immunoglobulin antibodies. Computational approaches to this problem are integral to virtual screening and modern drug discovery. Our system is based on an ensemble of support vector machines and exhibits state-of-the-art performance. It placed 1st in the 2010 DREAM5 competition. Second, we investigate the problem of biological network alignment. Aligning the biological networks of different species allows for the discovery of shared structures and conserved pathways. We introduce an original procedure for network alignment based on a novel topological node signature. The pairwise global alignments of biological networks produced by our procedure, when evaluated under multiple metrics, are both more accurate and more robust to noise than those of previous work. Next, we explore the problem of ancestral network reconstruction. Knowing the state of ancestral networks allows us to examine how biological pathways have evolved, and how pathways in extant species have diverged from that of their common ancestor. We describe a novel framework for representing the evolutionary histories of biological networks and present efficient algorithms for reconstructing either a single parsimonious evolutionary history, or an ensemble of near-optimal histories. Under multiple models of network evolution, our approaches are effective at inferring the ancestral network interactions. Additionally, the ensemble approach is robust to noisy input, and can be used to impute missing interactions in experimental data. Finally, we introduce a framework, GrowCode, for learning network growth models. While previous work focuses on developing growth models manually, or on procedures for learning parameters for existing models, GrowCode learns fundamentally new growth models that match target networks in a flexible and user-defined way. We show that models learned by GrowCode produce networks whose target properties match those of real-world networks more closely than existing models

    ProtASR2: Ancestral reconstruction of protein sequences accounting for folding stability

    Get PDF
    The ancestral sequence reconstruction (ASR) is a molecular evolution technique that provides applications to a variety of fields such as biotechnology and biomedicine. To infer ancestral sequences with realistic biological properties, the accuracy of ASR methods is crucial. We previously developed an ASR framework for proteins, called ProtASR, which is based on our site‐specific stability‐constrained substitution (SCS) model with selection on protein folding stability against both unfolding and misfolding. This model improved the empirical substitution models traditionally applied in ASR without increasing the computational complexity. However, it adopted a global exchangeability matrix, an approximation that we overcome here by considering site‐specific exchangeability matrices based on the Halpern–Bruno approach. Here we present ProtASR2, a new version of our ASR framework that implements novel SCS models of protein evolution, namely mean‐field (MF) and wild‐type (WT). ProtASR2 under MF and WT SCS models outperforms empirical models and previous SCS models in terms of goodness of fit and site‐specific distributions of amino acids. Importantly, the framework infers ancestral sequences with more realistic predicted folding stability with respect to simulated sequences, while empirical, CAT and other SCS models tend to overestimate the folding stability. We applied ProtASR2 to explore the evolution of two protein families present in diverse Prokaryota and found fluctuations of protein stability over time in both families. ProtASR2 is available from https://github.com/miguelarenas/protasr and the new SCS models are also available from https://github.com/ugobas/protevol Use of ProtASR2 will allow more realistic inferences of ancestral proteins in terms of folding stability with respect to those based on traditional empirical and CAT substitution models of protein evolution.Agencia Estatal de Investigación | Ref. RYC-2015-18241Agencia Estatal de Investigación | Ref. BIO2016-79043-PXunta de Galicia | Ref. ED431F 2018/08Fundación Ramón Arece

    Computational Approaches to Understanding the Structure, Dynamics, Functions, and Mechanisms of Various Bacterial Proteins

    Get PDF
    The 3D structure of a protein can be fundamentally useful for understanding protein function. In the absence of an experimentally determined structure, the most common way to obtain protein structures is to use homology modeling, or the mapping of the target sequence onto a closely related homolog with an available structure. However, despite recent efforts in structural biology, the 3D structures of many proteins remain unknown. Recent advances in genomic and metagenomic sequencing coupled with coevolution analysis and protein structure prediction have allowed for highly accurate models of proteins that were previously considered intractable to model due to the lack of suitable templates. Structural models obtained from homology modeling, coevolution-based modeling, or crystallography can then be used with other computational tools such as small molecule docking or molecular dynamics (MD) simulations to help understand protein function, dynamics, and mechanism.Here coevolution-based modeling was used to build a structural model of the HgcAB complex involved in mercury methylation (Chapter I). Based on the model it was proposed that conserved cysteines in HgcB are involved in shuttling mercury, methylmercury, or both. MD simulations and docking to a homology model of E. coli inosine monophosphate dehydrogenase (IMPDH) provided insights into how a single amino acid mutation could relieve inhibition by altering protein structure and dynamics (Chapter II). Coevolution-based structure prediction was also combined with docking, and experimental activity data to generate machine learning models that predict enzyme substrate scope for a series of bacterial nitrilases (Chapter III). Machine learning was also used to identify physicochemical properties that describe outer membrane permeability and efflux in E. coli and P. aeruginosa and new efflux pump inhibitors for the E. coli AcrAB-TolC efflux pump were identified using existing physicochemical guidelines in combination with small molecule docking to a homology model of AcrA (Chapter IV). Lastly, quantum mechanical/molecular mechanical simulations were used to study the mechanism of a key proton transfer step in Toho-1 beta-lactamase using experimentally determined structures of both the apo and cefotaxime-bound forms. These simulations revealed that substrate binding promotes catalysis by enhancing the favorability of this initial proton transfer step (Chapter V)

    The investigation of type-specific features of the copper coordinating AA9 proteins and their effect on the interaction with crystalline cellulose using molecular dynamics studies

    Get PDF
    AA9 proteins are metallo-enzymes which are crucial for the early stages of cellulose degradation. AA9 proteins have been suggested to cleave glycosidic bonds linking cellulose through the use of their Cu2+ coordinating active site. AA9 proteins possess different regioselectivities depending on the resulting cleavage they form and as result, are grouped accordingly. Type 1 AA9 proteins cleave the C1 carbon of cellulose while Type 2 AA9 proteins cleave the C4 carbon and Type 3 AA9 proteins cleave either C1 or C4 carbons. The steric congestion of the AA9 active site has been proposed to be a contributor to the observed regioselectivity. As such, a bioinformatics characterisation of type-specific sequence and structural features was performed. Initially AA9 protein sequences were obtained from the Pfam database and multiple sequence alignment was performed. The sequences were phylogenetically characterised and sequences were grouped into their respective types and sub-groups were identified. A selection analysis was performed on AA9 LPMO types to determine the selective pressure acting on AA9 protein residues. Motif discovery was then performed to identify conserved sequence motifs in AA9 proteins. Once type-specific sequence features were identified structural mapping was performed to assess possible effects on substrate interaction. Physicochemical property analysis was also performed to assess biochemical differences between AA9 LPMO types. Molecular dynamics (MD) simulations were then employed to dynamically assess the consequences of the discovered type-specific features on AA9-cellulose interaction. Due to the absence of AA9 specific force field parameters MD simulations were not readily applicable. As a result, Potential Energy Surface (PES) scans were performed to evaluate the force field parameters for the AA9 active site using the PM6 semi empirical approach and least squares fitting. A Type 1 AA9 active site was constructed from the crystal structure 4B5Q, encompassing only the Cu2+ coordinating residues, the Cu2+ ion and two water residues. Due to the similarity in AA9 active sites, the Type force field parameters were validated on all three AA9 LPMO types. Two MD simulations for each AA9 LPMO types were conducted using two separate Lennard-Jones parameter sets. Once completed, the MD trajectories were analysed for various features including the RMSD, RMSF, radius of gyration, coordination during simulation, hydrogen bonding, secondary structure conservation and overall protein movement. Force field parameters were successfully evaluated and validated for AA9 proteins. MD simulations of AA9 proteins were able to reveal the presence of unique type-specific binding modes of AA9 active sites to cellulose. These binding modes were characterised by the presence of unique type-specific loops which were present in Type 2 and 3 AA9 proteins but not in Type 1 AA9 proteins. The loops were found to result in steric congestion that affects how the Cu2+ ion interacts with cellulose. As a result, Cu2+ binding to cellulose was observed for Type 1 and not Type 2 and 3 AA9 proteins. In this study force field parameters have been evaluated for the Type 1 active site of AA9 proteins and this parameters were evaluated on all three types and binding. Future work will focus on identifying the nature of the reactive oxygen species and performing QM/MM calculations to elucidate the reactive mechanism of all three AA9 LPMO types

    Beyond structural genomics: computational approaches for the identification of ligand binding sites in protein structures

    Get PDF
    t Structural genomics projects have revealed structures for a large number of proteins of unknown function. Understanding the interactions between these proteins and their ligands would provide an initial step in their functional characterization. Binding site identification methods are a fast and cost-effective way to facilitate the characterization of functionally important protein regions. In this review we describe our recently developed methods for binding site identification in the context of existing methods. The advantage of energy-based approaches is emphasized, since they provide flexibility in the identifi- cation and characterization of different types of binding site

    Bioinformatické metody detekce koevoluce proteinů

    Get PDF
    The term coevolution describes the situation when two or more species or biomole- cules reciprocally affect each others' evolution. On the protein level, it is thought to be the main mechanism ensuring correct folding, interactions and function of a protein, and it can be observed both on the level of interacting protein families and individual amino acid residues. Coevolution studies have been proved to be a powerful tool for prediction of protein structure, function, interaction partners, etc. In this thesis, different algorithms used for detection of protein coevolution are described, as well as their applications and limitations. Keywords: coevolution, protein family, protein structure prediction, interac- tion partners, correlated mutations, mirrortree, mutual information, direct cou- pling analysisSlovem koevoluce popisujeme stav, kdy dva či více druhů nebo biomolekul vzá- jemně ovlivňují svou evoluci. Na proteinové úrovni je koevoluce považována za jeden z hlavních mechanismů zajišťujících správné sbalení, interakce a funkci pro- teinů. Pozorována může být jak na úrovni interagujících proteinových rodin, tak na úrovni jednotlivých aminokyselinových residuí. Studium koevoluce může být užitečným nástrojem při predikci struktury proteinů, jejich funkce, interakčních partnerů, apod. V této práci jsou popsány algoritmy, které jsou používány k detekci koevoluce proteinů, stejně jako jejich možné aplikace a omezení. Klíčová slova: koevoluce, proteinová rodina, predikce struktury proteinů, in- terakční partneři, korelované mutace, mirrortree, vzájemná informace, analýza přímého párováníDepartment of Cell BiologyKatedra buněčné biologieFaculty of SciencePřírodovědecká fakult
    corecore