2,862 research outputs found

    Qualitative System Identification from Imperfect Data

    Full text link
    Experience in the physical sciences suggests that the only realistic means of understanding complex systems is through the use of mathematical models. Typically, this has come to mean the identification of quantitative models expressed as differential equations. Quantitative modelling works best when the structure of the model (i.e., the form of the equations) is known; and the primary concern is one of estimating the values of the parameters in the model. For complex biological systems, the model-structure is rarely known and the modeler has to deal with both model-identification and parameter-estimation. In this paper we are concerned with providing automated assistance to the first of these problems. Specifically, we examine the identification by machine of the structural relationships between experimentally observed variables. These relationship will be expressed in the form of qualitative abstractions of a quantitative model. Such qualitative models may not only provide clues to the precise quantitative model, but also assist in understanding the essence of that model. Our position in this paper is that background knowledge incorporating system modelling principles can be used to constrain effectively the set of good qualitative models. Utilising the model-identification framework provided by Inductive Logic Programming (ILP) we present empirical support for this position using a series of increasingly complex artificial datasets. The results are obtained with qualitative and quantitative data subject to varying amounts of noise and different degrees of sparsity. The results also point to the presence of a set of qualitative states, which we term kernel subsets, that may be necessary for a qualitative model-learner to learn correct models. We demonstrate scalability of the method to biological system modelling by identification of the glycolysis metabolic pathway from data

    An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat

    Get PDF
    Abstract: In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial

    Predicting rice phenotypes with meta and multi-target learning

    Get PDF
    Abstract: The features in some machine learning datasets can naturally be divided into groups. This is the case with genomic data, where features can be grouped by chromosome. In many applications it is common for these groupings to be ignored, as interactions may exist between features belonging to different groups. However, including a group that does not influence a response introduces noise when fitting a model, leading to suboptimal predictive accuracy. Here we present two general frameworks for the generation and combination of meta-features when feature groupings are present. Furthermore, we make comparisons to multi-target learning, given that one is typically interested in predicting multiple phenotypes. We evaluated the frameworks and multi-target learning approaches on a genomic rice dataset where the regression task is to predict plant phenotype. Our results demonstrate that there are use cases for both the meta and multi-target approaches, given that overall, they significantly outperform the base case

    Transformative Machine Learning

    Get PDF
    The key to success in machine learning (ML) is the use of effective data representations. Traditionally, data representations were hand-crafted. Recently it has been demonstrated that, given sufficient data, deep neural networks can learn effective implicit representations from simple input representations. However, for most scientific problems, the use of deep learning is not appropriate as the amount of available data is limited, and/or the output models must be explainable. Nevertheless, many scientific problems do have significant amounts of data available on related tasks, which makes them amenable to multi-task learning, i.e. learning many related problems simultaneously. Here we propose a novel and general representation learning approach for multi-task learning that works successfully with small amounts of data. The fundamental new idea is to transform an input intrinsic data representation (i.e., handcrafted features), to an extrinsic representation based on what a pre-trained set of models predict about the examples. This transformation has the dual advantages of producing significantly more accurate predictions, and providing explainable models. To demonstrate the utility of this transformative learning approach, we have applied it to three real-world scientific problems: drug-design (quantitative structure activity relationship learning), predicting human gene expression (across different tissue types and drug treatments), and meta-learning for machine learning (predicting which machine learning methods work best for a given problem). In all three problems, transformative machine learning significantly outperforms the best intrinsic representation

    The use of small angle neutron scattering with contrast matching and variable adsorbate partial pressures in the study of porosity in activated carbons

    Get PDF
    The porosity of a typical activated carbon is investigated with small angle neutron scattering (SANS), using the contrast matching technique, by changing the hydrogen/deuterium content of the absorbed liquid (toluene) to extract the carbon density at different scattering vector (Q) values and by measuring the p/p0 dependence of the SANS, using fully deuterated toluene. The contrast matching data shows that the apparent density is Q-dependent, either because of pores opening near the carbon surface during the activation processor or changes in D-toluene density in nanoscale pores. For each p/p0 value, evaluation of the Porod Invariant yields the fraction of empty pores. Hence, comparison with the adsorption isotherm shows that the fully dry powder undergoes densification when liquid is added. An algebraic function is developed to fit the SANS signal at each p/p0 value hence yielding the effective Kelvin radii of the liquid surfaces as a function of p/p0. These values, when compared with the Kelvin Equation, show that the resultant surface tension value is accurate for the larger pores but tends to increase for small (nanoscale) pores. The resultant pore size distribution is less model-dependent than for the traditional methods of analyzing the adsorption isotherms

    Beating the Best: Improving on AlphaFold2 at Protein Structure Prediction

    Full text link
    The goal of Protein Structure Prediction (PSP) problem is to predict a protein's 3D structure (confirmation) from its amino acid sequence. The problem has been a 'holy grail' of science since the Noble prize-winning work of Anfinsen demonstrated that protein conformation was determined by sequence. A recent and important step towards this goal was the development of AlphaFold2, currently the best PSP method. AlphaFold2 is probably the highest profile application of AI to science. Both AlphaFold2 and RoseTTAFold (another impressive PSP method) have been published and placed in the public domain (code & models). Stacking is a form of ensemble machine learning ML in which multiple baseline models are first learnt, then a meta-model is learnt using the outputs of the baseline level model to form a model that outperforms the base models. Stacking has been successful in many applications. We developed the ARStack PSP method by stacking AlphaFold2 and RoseTTAFold. ARStack significantly outperforms AlphaFold2. We rigorously demonstrate this using two sets of non-homologous proteins, and a test set of protein structures published after that of AlphaFold2 and RoseTTAFold. As more high quality prediction methods are published it is likely that ensemble methods will increasingly outperform any single method.Comment: 12 page

    LGEM+^\text{+}: a first-order logic framework for automated improvement of metabolic network models through abduction

    Full text link
    Scientific discovery in biology is difficult due to the complexity of the systems involved and the expense of obtaining high quality experimental data. Automated techniques are a promising way to make scientific discoveries at the scale and pace required to model large biological systems. A key problem for 21st century biology is to build a computational model of the eukaryotic cell. The yeast Saccharomyces cerevisiae is the best understood eukaryote, and genome-scale metabolic models (GEMs) are rich sources of background knowledge that we can use as a basis for automated inference and investigation. We present LGEM+, a system for automated abductive improvement of GEMs consisting of: a compartmentalised first-order logic framework for describing biochemical pathways (using curated GEMs as the expert knowledge source); and a two-stage hypothesis abduction procedure. We demonstrate that deductive inference on logical theories created using LGEM+, using the automated theorem prover iProver, can predict growth/no-growth of S. cerevisiae strains in minimal media. LGEM+ proposed 2094 unique candidate hypotheses for model improvement. We assess the value of the generated hypotheses using two criteria: (a) genome-wide single-gene essentiality prediction, and (b) constraint of flux-balance analysis (FBA) simulations. For (b) we developed an algorithm to integrate FBA with the logic model. We rank and filter the hypotheses using these assessments. We intend to test these hypotheses using the robot scientist Genesis, which is based around chemostat cultivation and high-throughput metabolomics.Comment: 15 pages, one figure, two tables, two algorithm

    Functional Expression of Parasite Drug Targets and Their Human Orthologs in Yeast

    Get PDF
    BACKGROUND: The exacting nutritional requirements and complicated life cycles of parasites mean that they are not always amenable to high-throughput drug screening using automated procedures. Therefore, we have engineered the yeast Saccharomyces cerevisiae to act as a surrogate for expressing anti-parasitic targets from a range of biomedically important pathogens, to facilitate the rapid identification of new therapeutic agents. METHODOLOGY/PRINCIPAL FINDINGS: Using pyrimethamine/dihydrofolate reductase (DHFR) as a model parasite drug/drug target system, we explore the potential of engineered yeast strains (expressing DHFR enzymes from Plasmodium falciparum, P. vivax, Homo sapiens, Schistosoma mansoni, Leishmania major, Trypanosoma brucei and T. cruzi) to exhibit appropriate differential sensitivity to pyrimethamine. Here, we demonstrate that yeast strains (lacking the major drug efflux pump, Pdr5p) expressing yeast ((Sc)DFR1), human ((Hs)DHFR), Schistosoma ((Sm)DHFR), and Trypanosoma ((Tb)DHFR and (Tc)DHFR) DHFRs are insensitive to pyrimethamine treatment, whereas yeast strains producing Plasmodium ((Pf)DHFR and (Pv)DHFR) DHFRs are hypersensitive. Reassuringly, yeast strains expressing field-verified, drug-resistant mutants of P. falciparum DHFR ((Pf)dhfr (51I,59R,108N)) are completely insensitive to pyrimethamine, further validating our approach to drug screening. We further show the versatility of the approach by replacing yeast essential genes with other potential drug targets, namely phosphoglycerate kinases (PGKs) and N-myristoyl transferases (NMTs). CONCLUSIONS/SIGNIFICANCE: We have generated a number of yeast strains that can be successfully harnessed for the rapid and selective identification of urgently needed anti-parasitic agents
    • …
    corecore