2,853 research outputs found
Qualitative System Identification from Imperfect Data
Experience in the physical sciences suggests that the only realistic means of
understanding complex systems is through the use of mathematical models.
Typically, this has come to mean the identification of quantitative models
expressed as differential equations. Quantitative modelling works best when the
structure of the model (i.e., the form of the equations) is known; and the
primary concern is one of estimating the values of the parameters in the model.
For complex biological systems, the model-structure is rarely known and the
modeler has to deal with both model-identification and parameter-estimation. In
this paper we are concerned with providing automated assistance to the first of
these problems. Specifically, we examine the identification by machine of the
structural relationships between experimentally observed variables. These
relationship will be expressed in the form of qualitative abstractions of a
quantitative model. Such qualitative models may not only provide clues to the
precise quantitative model, but also assist in understanding the essence of
that model. Our position in this paper is that background knowledge
incorporating system modelling principles can be used to constrain effectively
the set of good qualitative models. Utilising the model-identification
framework provided by Inductive Logic Programming (ILP) we present empirical
support for this position using a series of increasingly complex artificial
datasets. The results are obtained with qualitative and quantitative data
subject to varying amounts of noise and different degrees of sparsity. The
results also point to the presence of a set of qualitative states, which we
term kernel subsets, that may be necessary for a qualitative model-learner to
learn correct models. We demonstrate scalability of the method to biological
system modelling by identification of the glycolysis metabolic pathway from
data
An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat
Abstract: In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial
Predicting rice phenotypes with meta and multi-target learning
Abstract: The features in some machine learning datasets can naturally be divided into groups. This is the case with genomic data, where features can be grouped by chromosome. In many applications it is common for these groupings to be ignored, as interactions may exist between features belonging to different groups. However, including a group that does not influence a response introduces noise when fitting a model, leading to suboptimal predictive accuracy. Here we present two general frameworks for the generation and combination of meta-features when feature groupings are present. Furthermore, we make comparisons to multi-target learning, given that one is typically interested in predicting multiple phenotypes. We evaluated the frameworks and multi-target learning approaches on a genomic rice dataset where the regression task is to predict plant phenotype. Our results demonstrate that there are use cases for both the meta and multi-target approaches, given that overall, they significantly outperform the base case
Transformative Machine Learning
The key to success in machine learning (ML) is the use of effective data
representations. Traditionally, data representations were hand-crafted.
Recently it has been demonstrated that, given sufficient data, deep neural
networks can learn effective implicit representations from simple input
representations. However, for most scientific problems, the use of deep
learning is not appropriate as the amount of available data is limited, and/or
the output models must be explainable. Nevertheless, many scientific problems
do have significant amounts of data available on related tasks, which makes
them amenable to multi-task learning, i.e. learning many related problems
simultaneously. Here we propose a novel and general representation learning
approach for multi-task learning that works successfully with small amounts of
data. The fundamental new idea is to transform an input intrinsic data
representation (i.e., handcrafted features), to an extrinsic representation
based on what a pre-trained set of models predict about the examples. This
transformation has the dual advantages of producing significantly more accurate
predictions, and providing explainable models. To demonstrate the utility of
this transformative learning approach, we have applied it to three real-world
scientific problems: drug-design (quantitative structure activity relationship
learning), predicting human gene expression (across different tissue types and
drug treatments), and meta-learning for machine learning (predicting which
machine learning methods work best for a given problem). In all three problems,
transformative machine learning significantly outperforms the best intrinsic
representation
The use of small angle neutron scattering with contrast matching and variable adsorbate partial pressures in the study of porosity in activated carbons
The porosity of a typical activated carbon is investigated with small angle neutron scattering (SANS), using the contrast matching technique, by changing the hydrogen/deuterium content of the absorbed liquid (toluene) to extract the carbon density at different scattering vector (Q) values and by measuring the p/p0 dependence of the SANS, using fully deuterated toluene. The contrast matching data shows that the apparent density is Q-dependent, either because of pores opening near the carbon surface during the activation processor or changes in D-toluene density in nanoscale pores. For each p/p0 value, evaluation of the Porod Invariant yields the fraction of empty pores. Hence, comparison with the adsorption isotherm shows that the fully dry powder undergoes densification when liquid is added. An algebraic function is developed to fit the SANS signal at each p/p0 value hence yielding the effective Kelvin radii of the liquid surfaces as a function of p/p0. These values, when compared with the Kelvin Equation, show that the resultant surface tension value is accurate for the larger pores but tends to increase for small (nanoscale) pores. The resultant pore size distribution is less model-dependent than for the traditional methods of analyzing the adsorption isotherms
Beating the Best: Improving on AlphaFold2 at Protein Structure Prediction
The goal of Protein Structure Prediction (PSP) problem is to predict a
protein's 3D structure (confirmation) from its amino acid sequence. The problem
has been a 'holy grail' of science since the Noble prize-winning work of
Anfinsen demonstrated that protein conformation was determined by sequence. A
recent and important step towards this goal was the development of AlphaFold2,
currently the best PSP method. AlphaFold2 is probably the highest profile
application of AI to science. Both AlphaFold2 and RoseTTAFold (another
impressive PSP method) have been published and placed in the public domain
(code & models). Stacking is a form of ensemble machine learning ML in which
multiple baseline models are first learnt, then a meta-model is learnt using
the outputs of the baseline level model to form a model that outperforms the
base models. Stacking has been successful in many applications. We developed
the ARStack PSP method by stacking AlphaFold2 and RoseTTAFold. ARStack
significantly outperforms AlphaFold2. We rigorously demonstrate this using two
sets of non-homologous proteins, and a test set of protein structures published
after that of AlphaFold2 and RoseTTAFold. As more high quality prediction
methods are published it is likely that ensemble methods will increasingly
outperform any single method.Comment: 12 page
LGEM: a first-order logic framework for automated improvement of metabolic network models through abduction
Scientific discovery in biology is difficult due to the complexity of the
systems involved and the expense of obtaining high quality experimental data.
Automated techniques are a promising way to make scientific discoveries at the
scale and pace required to model large biological systems. A key problem for
21st century biology is to build a computational model of the eukaryotic cell.
The yeast Saccharomyces cerevisiae is the best understood eukaryote, and
genome-scale metabolic models (GEMs) are rich sources of background knowledge
that we can use as a basis for automated inference and investigation.
We present LGEM+, a system for automated abductive improvement of GEMs
consisting of: a compartmentalised first-order logic framework for describing
biochemical pathways (using curated GEMs as the expert knowledge source); and a
two-stage hypothesis abduction procedure.
We demonstrate that deductive inference on logical theories created using
LGEM+, using the automated theorem prover iProver, can predict growth/no-growth
of S. cerevisiae strains in minimal media. LGEM+ proposed 2094 unique candidate
hypotheses for model improvement. We assess the value of the generated
hypotheses using two criteria: (a) genome-wide single-gene essentiality
prediction, and (b) constraint of flux-balance analysis (FBA) simulations. For
(b) we developed an algorithm to integrate FBA with the logic model. We rank
and filter the hypotheses using these assessments. We intend to test these
hypotheses using the robot scientist Genesis, which is based around chemostat
cultivation and high-throughput metabolomics.Comment: 15 pages, one figure, two tables, two algorithm
Functional Expression of Parasite Drug Targets and Their Human Orthologs in Yeast
BACKGROUND: The exacting nutritional requirements and complicated life cycles of parasites mean that they are not always amenable to high-throughput drug screening using automated procedures. Therefore, we have engineered the yeast Saccharomyces cerevisiae to act as a surrogate for expressing anti-parasitic targets from a range of biomedically important pathogens, to facilitate the rapid identification of new therapeutic agents. METHODOLOGY/PRINCIPAL FINDINGS: Using pyrimethamine/dihydrofolate reductase (DHFR) as a model parasite drug/drug target system, we explore the potential of engineered yeast strains (expressing DHFR enzymes from Plasmodium falciparum, P. vivax, Homo sapiens, Schistosoma mansoni, Leishmania major, Trypanosoma brucei and T. cruzi) to exhibit appropriate differential sensitivity to pyrimethamine. Here, we demonstrate that yeast strains (lacking the major drug efflux pump, Pdr5p) expressing yeast ((Sc)DFR1), human ((Hs)DHFR), Schistosoma ((Sm)DHFR), and Trypanosoma ((Tb)DHFR and (Tc)DHFR) DHFRs are insensitive to pyrimethamine treatment, whereas yeast strains producing Plasmodium ((Pf)DHFR and (Pv)DHFR) DHFRs are hypersensitive. Reassuringly, yeast strains expressing field-verified, drug-resistant mutants of P. falciparum DHFR ((Pf)dhfr (51I,59R,108N)) are completely insensitive to pyrimethamine, further validating our approach to drug screening. We further show the versatility of the approach by replacing yeast essential genes with other potential drug targets, namely phosphoglycerate kinases (PGKs) and N-myristoyl transferases (NMTs). CONCLUSIONS/SIGNIFICANCE: We have generated a number of yeast strains that can be successfully harnessed for the rapid and selective identification of urgently needed anti-parasitic agents
- …