7 research outputs found

    Performance of regression models as a function of experiment noise

    Get PDF
    A challenge in developing machine learning regression models is that it is difficult to know whether maximal performance has been reached on a particular dataset, or whether further model improvement is possible. In biology this problem is particularly pronounced as sample labels (response variables) are typically obtained through experiments and therefore have experiment noise associated with them. Such label noise puts a fundamental limit to the performance attainable by regression models. We address this challenge by deriving a theoretical upper bound for the coefficient of determination (R2) for regression models. This theoretical upper bound depends only on the noise associated with the response variable in a dataset as well as its variance. The upper bound estimate was validated via Monte Carlo simulations and then used as a tool to bootstrap performance of regression models trained on biological datasets, including protein sequence data, transcriptomic data, and genomic data. Although we study biological datasets in this work, the new upper bound estimates will hold true for regression models from any research field or application area where response variables have associated noise

    Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty.

    Get PDF
    Measurements of protein-ligand interactions have reproducibility limits due to experimental errors. Any model based on such assays will consequentially have such unavoidable errors influencing their performance which should ideally be factored into modelling and output predictions, such as the actual standard deviation of experimental measurements (σ) or the associated comparability of activity values between the aggregated heterogenous activity units (i.e., Ki versus IC50 values) during dataset assimilation. However, experimental errors are usually a neglected aspect of model generation. In order to improve upon the current state-of-the-art, we herein present a novel approach toward predicting protein-ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF algorithm was applied toward in silico protein target prediction across ~ 550 tasks from ChEMBL and PubChem. Predictions were evaluated by taking into account various scenarios of experimental standard deviations in both training and test sets and performance was assessed using fivefold stratified shuffled splits for validation. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information was not considered in any way in the original RF algorithm. For example, in cases when σ ranged between 0.4-0.6 log units and when ideal probability estimates between 0.4-0.6, the PRF outperformed RF with a median absolute error margin of ~ 17%. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold), although the RF models gave errors smaller than the experimental uncertainty, which could indicate that they were overtrained and/or over-confident. Finally, the PRF models trained with putative inactives decreased the performance compared to PRF models without putative inactives and this could be because putative inactives were not assigned an experimental pXC50 value, and therefore they were considered inactives with a low uncertainty (which in practice might not be true). In conclusion, PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold

    Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure

    Get PDF
    Understanding the genetic regulatory code governing gene expression is an important challenge in molecular biology. However, how individual coding and non-coding regions of the gene regulatory structure interact and contribute to mRNA expression levels remains unclear. Here we apply deep learning on over 20,000 mRNA datasets to examine the genetic regulatory code controlling mRNA abundance in 7 model organisms ranging from bacteria to Human. In all organisms, we can predict mRNA abundance directly from DNA sequence, with up to 82% of the variation of transcript levels encoded in the gene regulatory structure. By searching for DNA regulatory motifs across the gene regulatory structure, we discover that motif interactions could explain the whole dynamic range of mRNA levels.\ua0Co-evolution across coding and non-coding regions suggests that it is not single motifs or regions, but the entire gene regulatory structure and specific combination of regulatory elements that define gene expression levels

    Application of machine learning in systems biology

    Get PDF
    Biological systems are composed of a large number of molecular components. Understanding their behavior as a result of the interactions between the individual components is one of the aims of systems biology. Computational modelling is a powerful tool commonly used in systems biology, which relies on mathematical models that capture the properties and interactions between molecular components to simulate the behavior of the whole system. However, in many biological systems, it becomes challenging to build reliable mathematical models due to the complexity and the poor understanding of the underlying mechanisms. With the breakthrough in big data technologies in biology, data-driven machine learning (ML) approaches offer a promising complement to traditional theory-based models in systems biology. Firstly, ML can be used to model the systems in which the relationships between the components and the system are too complex to be modelled with theory-based models. Two such examples of using ML to resolve the genotype-phenotype relationships are presented in this thesis: (i) predicting yeast phenotypes using genomic features and (ii) predicting the thermal niche of microorganisms based on the proteome features. Secondly, ML naturally complements theory-based models. By applying ML, I improved the performance of the genome-scale metabolic model in describing yeast thermotolerance. In this application, ML was used to estimate the thermal parameters by using a Bayesian statistical learning approach that trains regression models and performs uncertainty quantification and reduction. The predicted bottleneck genes were further validated by experiments in improving yeast thermotolerance. In such applications, regression models are frequently used, and their performance relies on many factors, including but not limited to feature engineering and quality of response values. Manually engineering sufficient relevant features is particularly challenging in biology due to the lack of knowledge in certain areas. With the increasing volume of big data, deep-transfer learning enables us to learn a statistical summary of the samples from a big dataset which can be used as input to train other ML models. In the present thesis, I applied this approach to first learn a deep representation of enzyme thermal adaptation and then use it for the development of regression models for predicting enzyme optimal and protein melting temperatures. It was demonstrated that the transfer learning-based regression models outperform the classical ones trained on rationally engineered features in both cases. On the other hand, noisy response values are very common in biological datasets due to the variation in experimental measurements and they fundamentally restrict the performance attainable with regression models. I thereby addressed this challenge by deriving a theoretical upper bound for the coefficient of determination (R2) for regression models. This theoretical upper bound depends on the noise associated with the response variable and variance for a given dataset. It can thus be used to test whether the maximal performance has been reached on a particular dataset, or whether further model improvement is possible

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    MACHINE LEARNING AND BIOINFORMATIC INSIGHTS INTO KEY ENZYMES FOR A BIO-BASED CIRCULAR ECONOMY

    Get PDF
    The world is presently faced with a sustainability crisis; it is becoming increasingly difficult to meet the energy and material needs of a growing global population without depleting and polluting our planet. Greenhouse gases released from the continuous combustion of fossil fuels engender accelerated climate change, and plastic waste accumulates in the environment. There is need for a circular economy, where energy and materials are renewably derived from waste items, rather than by consuming limited resources. Deconstruction of the recalcitrant linkages in natural and synthetic polymers is crucial for a circular economy, as deconstructed monomers can be used to manufacture new products. In Nature, organisms utilize enzymes for the efficient depolymerization and conversion of macromolecules. Consequently, by employing enzymes industrially, biotechnology holds great promise for energy- and cost-efficient conversion of materials for a circular economy. However, there is need for enhanced molecular-level understanding of enzymes to enable economically viable technologies that can be applied on a global scale. This work is a computational study of key enzymes that catalyze important reactions that can be utilized for a bio-based circular economy. Specifically, bioinformatics and data- mining approaches were employed to study family 7 glycoside hydrolases (GH7s), which are the principal enzymes in Nature for deconstructing cellulose to simple sugars; a cytochrome P450 enzyme (GcoA) that catalyzes the demethylation of lignin subunits; and MHETase, a tannase-family enzyme utilized by the bacterium, Ideonella sakaiensis, in the degradation and assimilation of polyethylene terephthalate (PET). Since enzyme function is fundamentally dependent on the primary amino-acid sequence, we hypothesize that machine-learning algorithms can be trained on an ensemble of functionally related enzymes to reveal functional patterns in the enzyme family, and to map the primary sequence to enzyme function such that functional properties can be predicted for a new enzyme sequence with significant accuracy. We find that supervised machine learning identifies important residues for processivity and accurately predicts functional subtypes and domain architectures in GH7s. Bioinformatic analyses revealed conserved active-site residues in GcoA and informed protein engineering that enabled expanded enzyme specificity and improved activity. Similarly, bioinformatic studies and phylogenetic analysis provided evolutionary context and identified crucial residues for MHET-hydrolase activity in a tannase-family enzyme (MHETase). Lastly, we developed machine-learning models to predict enzyme thermostability, allowing for high-throughput screening of enzymes that can catalyze reactions at elevated temperatures. Altogether, this work provides a solid basis for a computational data-driven approach to understanding, identifying, and engineering enzymes for biotechnological applications towards a more sustainable world

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology
    corecore