135 research outputs found

    Contacts prediction of linear peptides from genomic data

    Get PDF
    The rise of metagenomics and the technological improvements in the fields of bioinformatics and computational biology led to an exponential increase in the amount of biological data available to be studied. However, the rate at which biological data are studied is much slower than the rate at which they are stored. This issue pushed the development of programs capable of extracting significant information from newly sourced data without the need of human intervention. More specifically, some of these programs have been developed to infer structural information from protein sequences. Since the structure of a protein is strictly bound to its function, it is easy to understand the importance of such task. Among the structural information which can be inferred looking at a protein sequence, there are contact maps. Contact maps define whether two residues are functionally linked within the same protein chain or two different ones. Despite much work has been carried out for intra-chain contact maps prediction using sequence information, less can be found about inter-chain contact maps. Moreover, methods are usually presented and tested on benchmark dataset generated for such purpose. In this, a whole pipeline for both intra-chain and inter-chain contact predictions is presented. Instead of using a generic benchmark set of protein sequences as input, the pipeline starts from predictions of linear interacting peptides at residues level. Linear interacting peptides are regions in a protein sequence which are thought to not have a fixed folding, but to adapt their structure to the functional needs of the protein itself. Needles to say, fewer studies have been conducted about this specific issue in literature. Finally, an analysis of the results is carried out. The analysis focuses on the evaluation of methods implied for contact predictions over the given dataset. Particular attention is paid to the comparison of the performances on inter-chain alignments with respect to the ones achieved on intra-chain alignments. Furthermore, the effect of linear interacting peptides is taken into account

    Bioinformatic approaches for understanding chromatin regulation

    No full text

    Deep Learning for Genomics: A Concise Overview

    Full text link
    Advancements in genomic research such as high-throughput sequencing techniques have driven modern genomic studies into "big data" disciplines. This data explosion is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in a variety of fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning since we are expecting from deep learning a superhuman intelligence that explores beyond our knowledge to interpret the genome. A powerful deep learning model should rely on insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with a proper deep architecture, and remark on practical considerations of developing modern deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research, as well as pointing out potential opportunities and obstacles for future genomics applications.Comment: Invited chapter for Springer Book: Handbook of Deep Learning Application

    Bayesian statistical approach for protein residue-residue contact prediction

    Get PDF
    Despite continuous efforts in automating experimental structure determination and systematic target selection in structural genomics projects, the gap between the number of known amino acid sequences and solved 3D structures for proteins is constantly widening. While DNA sequencing technologies are advancing at an extraordinary pace, thereby constantly increasing throughput while at the same time reducing costs, protein structure determination is still labour intensive, time-consuming and expensive. This trend illustrates the essential importance of complementary computational approaches in order to bridge the so-called sequence-structure gap. About half of the protein families lack structural annotation and therefore are not amenable to techniques that infer protein structure from homologs. These protein families can be addressed by de novo structure prediction approaches that in practice are often limited by the immense computational costs required to search the conformational space for the lowest-energy conformation. Improved predictions of contacts between amino acid residues have been demonstrated to sufficiently constrain the overall protein fold and thereby extend the applicability of de novo methods to larger proteins. Residue-residue contact prediction is based on the idea that selection pressure on protein structure and function can lead to compensatory mutations between spatially close residues. This leaves an echo of correlation signatures that can be traced down from the evolutionary record. Despite the success of contact prediction methods, there are several challenges. The most evident limitation lies in the requirement of deep alignments, which excludes the majority of protein families without associated structural information that are the focus for contact guided de novo structure prediction. The heuristics applied by current contact prediction methods pose another challenge, since they omit available coevolutionary information. This work presents two different approaches for addressing the limitations of contact prediction methods. Instead of inferring evolutionary couplings by maximizing the pseudo-likelihood, I maximize the full likelihood of the statistical model for protein sequence families. This approach performed with comparable precision up to minor improvements over the pseudo-likelihood methods for protein families with few homologous sequences. A Bayesian statistical approach has been developed that provides posterior probability estimates for residue-residue contacts and eradicates the use of heuristics. The full information of coevolutionary signatures is exploited by explicitly modelling the distribution of statistical couplings that reflects the nature of residue-residue interactions. Surprisingly, the posterior probabilities do not directly translate into more precise predictions than obtained by pseudo-likelihood methods combined with prior knowledge. However, the Bayesian framework offers a statistically clean and theoretically solid treatment for the contact prediction problem. This flexible and transparent framework provides a convenient starting point for further developments, such as integrating more complex prior knowledge. The model can also easily be extended towards the Derivation of probability estimates for residue-residue distances to enhance the precision of predicted structures

    Using Machine Learning to Better Predict the Structure of RNA and RNA Containing Complexes

    Full text link
    Determining the structure of RNA in the presence of drug like molecules is a crucial step in any drug development campaign. Standard experimental approaches are expensive and time-consuming, and current state-of-the-art computational methods are too inaccurate to be useful. In principle, computer docking can be used to pre- dict the 3D structure of RNA-ligand complexes. However the scoring functions which are accompanied by the available docking programs for pose ranking of RNA-ligand complexes miss-classify native like poses among a set of decoy poses. As such, there is a need for the development of fast, easy, and precise prediction methods for determining the 3D structure of RNAs. In theory, nuclear magnetic resonance (NMR) spectroscopy derived chemical shifts contain information about the local chemical environment at each site in a molecule and so can be a source of rich structural in- formation. In this work, the goal is to predict the structure of RNA-ligand complexes using NMR chemical shifts. To that end, we explore the effect of different machine learning algorithms and ring current models to accurately predict the chemical shifts for standard RNA-ligand complexes. Extra-Randomized trees machine learning algorithms and Pople ring current model were found to be the most accurate ones at predicting the chemical shifts of RNA-ligand complexes. Next we explored the use of chemical shifts to guide the 3D structure prediction of RNA-ligand complexes starting from RNA sequence. We applied CS-Fold, an in-house method which utilizes chemical shifts to guide the secondary structure prediction of RNAs. From the best predicted secondary structures using CS-Fold, we generated de novo 3D models of RNAs using the Fragment Assembly of RNA with Full Atom Refinement (FARFAR) approach. We used chemical shifts predicted by LarmorD to refine those 3D structures. We found that CS-Fold (the CS-guided secondary structure prediction approach) combined with Rosetta de novo protocol for 3D motifs prediction significantly enhanced the recovery rates to 50% compared to 20% obtained by the RNAStructure and Rosetta combination. Next we used rDock to dock the ligand from the 10 best predicted 3D structures of the RNA and filter the poses based on the chemical shift errors. This study motivated us to build ma- chine learning models based on a molecular fingerprinting approach that can recover native-like RNA-ligand structures from non-native ones in a decoy set as described below. Next, we describe RNAPoser, a computational tool that estimate the relative “nativeness” of a set of RNA-ligand poses using machine learning pose classifiers. We trained our pose classifiers on molecular “fingerprints” that were a fusion of atomic fingerprints. These fingerprints encode the local “RNA environment” around ligand atoms. Using the classification scores from our RNAPoser classifiers and ranking the poses based on those scores, we found that the recovery of native like poses is significantly better than those obtained from just using the raw rdock docking scores. We also performed a leave-one-out validation approach and found that RNAPoser could recover ∌80% of the poses that were within 2.5 A of the native poses, in 88 RNA-ligand complexes we explored. Likewise, on a validation set of 17 complexes, we could recover poses in ∌70% of the complexes. RNAPosers could be used as a tool to help in RNA-ligand pose prediction and hence we make it available to the academic community via https://github.com/atfrank/RNAPosers.PHDChemistryUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155127/1/itssahil_1.pd

    Support Vector Machine-based Fuzzy Systems for Quantitative Prediction of Peptide Binding Affinity

    Get PDF
    Reliable prediction of binding affinity of peptides is one of the most challenging but important complex modelling problems in the post-genome era due to the diversity and functionality of the peptides discovered. Generally, peptide binding prediction models are commonly used to find out whether a binding exists between a certain peptide(s) and a major histocompatibility complex (MHC) molecule(s). Recent research efforts have been focused on quantifying the binding predictions. The objective of this thesis is to develop reliable real-value predictive models through the use of fuzzy systems. A non-linear system is proposed with the aid of support vector-based regression to improve the fuzzy system and applied to the real value prediction of degree of peptide binding. This research study introduced two novel methods to improve structure and parameter identification of fuzzy systems. First, the support-vector based regression is used to identify initial parameter values of the consequent part of type-1 and interval type-2 fuzzy systems. Second, an overlapping clustering concept is used to derive interval valued parameters of the premise part of the type-2 fuzzy system. Publicly available peptide binding affinity data sets obtained from the literature are used in the experimental studies of this thesis. First, the proposed models are blind validated using the peptide binding affinity data sets obtained from a modelling competition. In that competition, almost an equal number of peptide sequences in the training and testing data sets (89, 76, 133 and 133 peptides for the training and 88, 76, 133 and 47 peptides for the testing) are provided to the participants. Each peptide in the data sets was represented by 643 bio-chemical descriptors assigned to each amino acid. Second, the proposed models are cross validated using mouse class I MHC alleles (H2-Db, H2-Kb and H2-Kk). H2-Db, H2-Kb, and H2-Kk consist of 65 nona-peptides, 62 octa-peptides, and 154 octa-peptides, respectively. Compared to the previously published results in the literature, the support vector-based type-1 and support vector-based interval type-2 fuzzy models yield an improvement in the prediction accuracy. The quantitative predictive performances have been improved as much as 33.6\% for the first group of data sets and 1.32\% for the second group of data sets. The proposed models not only improved the performance of the fuzzy system (which used support vector-based regression), but the support vector-based regression benefited from the fuzzy concept also. The results obtained here sets the platform for the presented models to be considered for other application domains in computational and/or systems biology. Apart from improving the prediction accuracy, this research study has also identified specific features which play a key role(s) in making reliable peptide binding affinity predictions. The amino acid features "Polarity", "Positive charge", "Hydrophobicity coefficient", and "Zimm-Bragg parameter" are considered as highly discriminating features in the peptide binding affinity data sets. This information can be valuable in the design of peptides with strong binding affinity to a MHC I molecule(s). This information may also be useful when designing drugs and vaccines

    Expression data dnalysis and regulatory network inference by means of correlation patterns

    Get PDF
    With the advance of high-throughput techniques, the amount of available data in the bio-molecular field is rapidly growing. It is now possible to measure genome-wide aspects of an entire biological system as a whole. Correlations that emerge due to internal dependency structures of these systems entail the formation of characteristic patterns in the corresponding data. The extraction of these patterns has become an integral part of computational biology. By triggering perturbations and interventions it is possible to induce an alteration of patterns, which may help to derive the dependency structures present in the system. In particular, differential expression experiments may yield alternate patterns that we can use to approximate the actual interplay of regulatory proteins and genetic elements, namely, the regulatory network of a cell. In this work, we examine the detection of correlation patterns from bio-molecular data and we evaluate their applicability in terms of protein contact prediction, experimental artifact removal, the discovery of unexpected expression patterns and genome-scale inference of regulatory networks. Correlation patterns are not limited to expression data. Their analysis in the context of conserved interfaces among proteins is useful to estimate whether these may have co-evolved. Patterns that hint on correlated mutations would then occur in the associated protein sequences as well. We employ a conceptually simple sampling strategy to decide whether or not two pathway elements share a conserved interface and are thus likely to be in physical contact. We successfully apply our method to a system of ABC-transporters and two-component systems from the phylum of Firmicute bacteria. For spatially resolved gene expression data like microarrays, the detection of artifacts, as opposed to noise, corresponds to the extraction of localized patterns that resemble outliers in a given region. We develop a method to detect and remove such artifacts using a sliding-window approach. Our method is very accurate and it is shown to adapt to other platforms like custom arrays as well. Further, we developed Padesco as a way to reveal unexpected expression patterns. We extract frequent and recurring patterns that are conserved across many experiments. For a specific experiment, we predict whether a gene deviates from its expected behaviour. We show that Padesco is an effective approach for selecting promising candidates from differential expression experiments. In Chapter 5, we then focus on the inference of genome-scale regulatory networks from expression data. Here, correlation patterns have proven useful for the data-driven estimation of regulatory interactions. We show that, for reliable eukaryotic network inference, the integration of prior networks is essential. We reveal that this integration leads to an over-estimate of network-wide quality estimates and suggest a corrective procedure, CoRe, to counterbalance this effect. CoRe drastically improves the false discovery rate of the originally predicted networks. We further suggest a consensus approach in combination with an extended set of topological features to obtain a more accurate estimate of the eukaryotic regulatory network for yeast. In the course of this work we show how correlation patterns can be detected and how they can be applied for various problem settings in computational molecular biology. We develop and discuss competitive approaches for the prediction of protein contacts, artifact repair, differential expression analysis, and network inference and show their applicability in practical setups.Mit der Weiterentwicklung von Hochdurchsatztechniken steigt die Anzahl verfĂŒgbarer Daten im Bereich der Molekularbiologie rapide an. Es ist heute möglich, genomweite Aspekte eines ganzen biologischen Systems komplett zu erfassen. Korrelationen, die aufgrund der internen AbhĂ€ngigkeits-Strukturen dieser Systeme enstehen, fĂŒhren zu charakteristischen Mustern in gemessenen Daten. Die Extraktion dieser Muster ist zum integralen Bestandteil der Bioinformatik geworden. Durch geplante Eingriffe in das System ist es möglich Muster-Änderungen auszulösen, die helfen, die AbhĂ€ngigkeits-Strukturen des Systems abzuleiten. Speziell differentielle Expressions-Experimente können Muster-Wechsel bedingen, die wir verwenden können, um uns dem tatsĂ€chlichen Wechselspiel von regulatorischen Proteinen und genetischen Elementen anzunĂ€hern, also dem regulatorischen Netzwerk einer Zelle. In der vorliegenden Arbeit beschĂ€ftigen wir uns mit der Erkennung von Korrelations-Mustern in molekularbiologischen Daten und schĂ€tzen ihre praktische Nutzbarkeit ab, speziell im Kontext der Kontakt-Vorhersage von Proteinen, der Entfernung von experimentellen Artefakten, der Aufdeckung unerwarteter Expressions-Muster und der genomweiten Vorhersage regulatorischer Netzwerke. Korrelations-Muster sind nicht auf Expressions-Daten beschrĂ€nkt. Ihre Analyse im Kontext konservierter Schnittstellen zwischen Proteinen liefert nĂŒtzliche Hinweise auf deren Ko-Evolution. Muster die auf korrelierte Mutationen hinweisen, wĂŒrden in diesem Fall auch in den entsprechenden Proteinsequenzen auftauchen. Wir nutzen eine einfache Sampling-Strategie, um zu entscheiden, ob zwei Elemente eines Pathways eine gemeinsame Schnittstelle teilen, berechnen also die Wahrscheinlichkeit fĂŒr deren physikalischen Kontakt. Wir wenden unsere Methode mit Erfolg auf ein System von ABC-Transportern und Zwei-Komponenten-Systemen aus dem Firmicutes Bakterien-Stamm an. FĂŒr rĂ€umlich aufgelöste Expressions-Daten wie Microarrays enspricht die Detektion von Artefakten der Extraktion lokal begrenzter Muster. Im Gegensatz zur Erkennung von Rauschen stellen diese innerhalb einer definierten Region Ausreißer dar. Wir entwickeln eine Methodik, um mit Hilfe eines Sliding-Window-Verfahrens, solche Artefakte zu erkennen und zu entfernen. Das Verfahren erkennt diese sehr zuverlĂ€ssig. Zudem kann es auf Daten diverser Plattformen, wie Custom-Arrays, eingesetzt werden. Als weitere Möglichkeit unerwartete Korrelations-Muster aufzudecken, entwickeln wir Padesco. Wir extrahieren hĂ€ufige und wiederkehrende Muster, die ĂŒber Experimente hinweg konserviert sind. FĂŒr ein bestimmtes Experiment sagen wir vorher, ob ein Gen von seinem erwarteten Verhalten abweicht. Wir zeigen, dass Padesco ein effektives Vorgehen ist, um vielversprechende Kandidaten eines differentiellen Expressions-Experiments auszuwĂ€hlen. Wir konzentrieren uns in Kapitel 5 auf die Vorhersage genomweiter regulatorischer Netzwerke aus Expressions-Daten. Hierbei haben sich Korrelations-Muster als nĂŒtzlich fĂŒr die datenbasierte AbschĂ€tzung regulatorischer Interaktionen erwiesen. Wir zeigen, dass fĂŒr die Inferenz eukaryotischer Systeme eine Integration zuvor bekannter Regulationen essentiell ist. Unsere Ergebnisse ergeben, dass diese Integration zur ÜberschĂ€tzung netzwerkĂŒbergreifender QualitĂ€tsmaße fĂŒhrt und wir schlagen eine Prozedur - CoRe - zur Verbesserung vor, um diesen Effekt auszugleichen. CoRe verbessert die False Discovery Rate der ursprĂŒnglich vorhergesagten Netzwerke drastisch. Weiterhin schlagen wir einen Konsensus-Ansatz in Kombination mit einem erweiterten Satz topologischer Features vor, um eine prĂ€zisere Vorhersage fĂŒr das eukaryotische Hefe-Netzwerk zu erhalten. Im Rahmen dieser Arbeit zeigen wir, wie Korrelations-Muster erkannt und wie sie auf verschiedene Problemstellungen der Bioinformatik angewandt werden können. Wir entwickeln und diskutieren AnsĂ€tze zur Vorhersage von Proteinkontakten, Behebung von Artefakten, differentiellen Analyse von Expressionsdaten und zur Vorhersage von Netzwerken und zeigen ihre Eignung im praktischen Einsatz

    The Influence of Allostery Governing the Changes in Protein Dynamics Upon Substitution

    Get PDF
    The focus of this research is to investigate the effects of allostery on the function/activity of an enzyme, human immunodeficiency virus type 1 (HIV-1) protease, using well-defined statistical analyses of the dynamic changes of the protein and variants with unique single point substitutions 1. The experimental data1 evaluated here only characterized HIV-1 protease with one of its potential target substrates. Probing the dynamic interactions of the residues of an enzyme and its variants can offer insight of the developmental importance for allosteric signaling and their connection to a protein’s function. The realignment of the secondary structure elements can modulate the mobility along with the frequency of residue contacts as well as which residues are making contact together2-5. We postulate that if there are more contacts occurring within a structure the mobility is being constrained and therefore gaining novel contacts can negatively influence the function of a protein. The evolutionary importance of protein dynamics is probed by analyzing the residue positions possessing significant correlations and the relationship between experimental information1 (variant activities). We propose that the correlated dynamics of residues observed to have considerable correlations, if disrupted, can be used to infer the function of HIV-1 protease and its variants. Given the robustness of HIV-1 protease the identification of any significant constraint imposed on the dynamics from a potential allosteric site found to disrupt the catalytic activity of the variant is not plainly evident. We also develop machine learning (ML) algorithms to predict the protein function/activity change caused by a single point substitution by using the DCC of each residue pair. Recognition of any substantial association between the dynamics of specific residues and allosteric communication or mechanism requires detailed examination of the dynamics of HIV-1 protease and its variants. We also explore the non-linear dependency between each pair of residues using Mutual Information (MI) and how it can influence the dynamics of HIV-1 protease and its variants. We suggest that if the residues of a protein receive more or less information than that of the WT it will adversely impact the function of the protein and can be used to support the classification of a variant structure. Furthermore, using the MI of residues obtained from the MD simulations for the HIV-1 protease structure, we build a ML model to predict a protein’s change in function caused by a single point substitution. Effectively the mobility, dynamics, and non-linear features tested in these studies are found to be useful towards the prediction of potentially drug resistant substitutions related to the catalytic efficiency of HIV-1 protease and the variants

    Bayesian statistical approach for protein residue-residue contact prediction

    Get PDF
    Despite continuous efforts in automating experimental structure determination and systematic target selection in structural genomics projects, the gap between the number of known amino acid sequences and solved 3D structures for proteins is constantly widening. While DNA sequencing technologies are advancing at an extraordinary pace, thereby constantly increasing throughput while at the same time reducing costs, protein structure determination is still labour intensive, time-consuming and expensive. This trend illustrates the essential importance of complementary computational approaches in order to bridge the so-called sequence-structure gap. About half of the protein families lack structural annotation and therefore are not amenable to techniques that infer protein structure from homologs. These protein families can be addressed by de novo structure prediction approaches that in practice are often limited by the immense computational costs required to search the conformational space for the lowest-energy conformation. Improved predictions of contacts between amino acid residues have been demonstrated to sufficiently constrain the overall protein fold and thereby extend the applicability of de novo methods to larger proteins. Residue-residue contact prediction is based on the idea that selection pressure on protein structure and function can lead to compensatory mutations between spatially close residues. This leaves an echo of correlation signatures that can be traced down from the evolutionary record. Despite the success of contact prediction methods, there are several challenges. The most evident limitation lies in the requirement of deep alignments, which excludes the majority of protein families without associated structural information that are the focus for contact guided de novo structure prediction. The heuristics applied by current contact prediction methods pose another challenge, since they omit available coevolutionary information. This work presents two different approaches for addressing the limitations of contact prediction methods. Instead of inferring evolutionary couplings by maximizing the pseudo-likelihood, I maximize the full likelihood of the statistical model for protein sequence families. This approach performed with comparable precision up to minor improvements over the pseudo-likelihood methods for protein families with few homologous sequences. A Bayesian statistical approach has been developed that provides posterior probability estimates for residue-residue contacts and eradicates the use of heuristics. The full information of coevolutionary signatures is exploited by explicitly modelling the distribution of statistical couplings that reflects the nature of residue-residue interactions. Surprisingly, the posterior probabilities do not directly translate into more precise predictions than obtained by pseudo-likelihood methods combined with prior knowledge. However, the Bayesian framework offers a statistically clean and theoretically solid treatment for the contact prediction problem. This flexible and transparent framework provides a convenient starting point for further developments, such as integrating more complex prior knowledge. The model can also easily be extended towards the Derivation of probability estimates for residue-residue distances to enhance the precision of predicted structures

    Bayesian model-based approaches with MCMC computation to some bioinformatics problems

    Get PDF
    Bioinformatics applications can address the transfer of information at several stages of the central dogma of molecular biology, including transcription and translation. This dissertation focuses on using Bayesian models to interpret biological data in bioinformatics, using Markov chain Monte Carlo (MCMC) for the inference method. First, we use our approach to interpret data at the transcription level. We propose a two-level hierarchical Bayesian model for variable selection on cDNA Microarray data. cDNA Microarray quantifies mRNA levels of a gene simultaneously so has thousands of genes in one sample. By observing the expression patterns of genes under various treatment conditions, important clues about gene function can be obtained. We consider a multivariate Bayesian regression model and assign priors that favor sparseness in terms of number of variables (genes) used. We introduce the use of different priors to promote different degrees of sparseness using a unified two-level hierarchical Bayesian model. Second, we apply our method to a problem related to the translation level. We develop hidden Markov models to model linker/non-linker sequence regions in a protein sequence. We use a linker index to exploit differences in amino acid composition between regions from sequence information alone. A goal of protein structure prediction is to take an amino acid sequence (represented as a sequence of letters) and predict its tertiary structure. The identification of linker regions in a protein sequence is valuable in predicting the three-dimensional structure. Because of the complexities of both models encountered in practice, we employ the Markov chain Monte Carlo method (MCMC), particularly Gibbs sampling (Gelfand and Smith, 1990) for the inference of the parameter estimation
    • 

    corecore