762 research outputs found
Constructing grids for molecular quantum dynamics using an autoencoder
A challenge for molecular quantum dynamics (QD) calculations is the curse of
dimensionality with respect to the nuclear degrees of freedom. A common
approach that works especially well for fast reactive processes is to reduce
the dimensionality of the system to a few most relevant coordinates.
Identifying these can become a very difficult task, since they often are highly
unintuitive. We present a machine learning approach that utilizes an
autoencoder that is trained to find a low-dimensional representation of a set
of molecular configurations. These configurations are generated by trajectory
calculations performed on the reactive molecular systems of interest. The
resulting low-dimensional representation can be used to generate a potential
energy surface grid in the desired subspace. Using the G-matrix formalism to
calculate the kinetic energy operator, QD calculations can be carried out on
this grid. In addition to step-by-step instructions for the grid construction,
we present the application to a test system.Comment: 24 pages, 6 figures, articl
Data-driven discovery of coordinates and governing equations
The discovery of governing equations from scientific data has the potential
to transform data-rich fields that lack well-characterized quantitative
descriptions. Advances in sparse regression are currently enabling the
tractable identification of both the structure and parameters of a nonlinear
dynamical system from data. The resulting models have the fewest terms
necessary to describe the dynamics, balancing model complexity with descriptive
ability, and thus promoting interpretability and generalizability. This
provides an algorithmic approach to Occam's razor for model discovery. However,
this approach fundamentally relies on an effective coordinate system in which
the dynamics have a simple representation. In this work, we design a custom
autoencoder to discover a coordinate transformation into a reduced space where
the dynamics may be sparsely represented. Thus, we simultaneously learn the
governing equations and the associated coordinate system. We demonstrate this
approach on several example high-dimensional dynamical systems with
low-dimensional behavior. The resulting modeling framework combines the
strengths of deep neural networks for flexible representation and sparse
identification of nonlinear dynamics (SINDy) for parsimonious models. It is the
first method of its kind to place the discovery of coordinates and models on an
equal footing.Comment: 25 pages, 6 figures; added acknowledgment
Time-lagged autoencoders: Deep learning of slow collective variables for molecular kinetics
Inspired by the success of deep learning techniques in the physical and
chemical sciences, we apply a modification of an autoencoder type deep neural
network to the task of dimension reduction of molecular dynamics data. We can
show that our time-lagged autoencoder reliably finds low-dimensional embeddings
for high-dimensional feature spaces which capture the slow dynamics of the
underlying stochastic processes - beyond the capabilities of linear dimension
reduction techniques
Nonlinear proper orthogonal decomposition for convection-dominated flows
Autoencoder techniques find increasingly common use in reduced order modeling as a means to create a latent space. This reduced order representation offers a modular data-driven modeling approach for nonlinear dynamical systems when integrated with a time series predictive model. In this Letter, we put forth a nonlinear proper orthogonal decomposition (POD) framework, which is an end-to-end Galerkin-free model combining autoencoders with long short-term memory networks for dynamics. By eliminating the projection error due to the truncation of Galerkin models, a key enabler of the proposed nonintrusive approach is the kinematic construction of a nonlinear mapping between the full-rank expansion of the POD coefficients and the latent space where the dynamics evolve. We test our framework for model reduction of a convection-dominated system, which is generally challenging for reduced order models. Our approach not only improves the accuracy, but also significantly reduces the computational cost of training and testing.
This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research under Award Number DE-SC0019290. O.S. gratefully acknowledges the Early Career Research Program (ECRP) support of the U.S. Department of Energy. O.S. also gratefully acknowledges the financial support of the National Science Foundation under Award No. DMS-2012255. T.I. acknowledges support through National Science Foundation Grant No. DMS-2012253.acceptedVersio
ML4Chem: A Machine Learning Package for Chemistry and Materials Science
ML4Chem is an open-source machine learning library for chemistry and
materials science. It provides an extendable platform to develop and deploy
machine learning models and pipelines and is targeted to the non-expert and
expert users. ML4Chem follows user-experience design and offers the needed
tools to go from data preparation to inference. Here we introduce its atomistic
module for the implementation, deployment, and reproducibility of atom-centered
models. This module is composed of six core building blocks: data,
featurization, models, model optimization, inference, and visualization. We
present their functionality and easiness of use with demonstrations utilizing
neural networks and kernel ridge regression algorithms.Comment: 32 pages, 11 Figure
Recommended from our members
Interpreting Deep Learning for cell differentiation. Supervised and Unsupervised models viewed through the lens of information and perturbation theory.
"Predicting the future isn't magic, it's artificial intelligence" Dave Waters.
In the last decades there has been an unprecedented growth in the field of machine learning, and particularly within deep learning models. The combination of big data and computational power has nurtured the evolution of a variety of new methods to predict and interpret future scenarios. These data centric models can achieve exceptional performances on specific tasks, with their prediction boundaries continuously expanding towards new and more complex challenges.
However, the model complexity often translates into a lack of interpretability from a scientific c perspective, it is not trivial to identify the factors involved in final outcomes.
Explainability may not always be a requirement for some machine learning tasks, specially when it comes in detriment of performance power. But for some applications, such as biological discoveries or medical diagnostics, understanding the output and determining factors that influence decisions is essential.
In this thesis we develop both a supervised and unsupervised approach to map from genotype to phenotype. We emphasise the importance of interpretability and feature extraction from the models, by identifying relevant genes for cell differentiation. We then continue to explore the rules and mechanisms behind the models from a theoretical perspective. Using information theory to explain the learning process and applying
perturbation theory to transform the results into a generalisable representation.
We start by building a supervised approach to mapping cell profiles from genotype to phenotype, using single cell RNA-Seq data. We leverage non-linearities among gene expressions to identify cellular levels of differentiation. The ambiguity and even absence of labels in most biological studies instigated the development of novel unsupervised techniques, leading to a new general and biologically interpretable framework based on Variational Autoencoders.
The application and validation of the methods has proven to be successful, but questions regarding the learning process and generative nature of the results remained unanswered. I use information theory to define a new approach to interpret training and the converged solutions of our models.
The variational and generative nature of Autoencoders provides a platform to develop general models. Their results should extrapolate and allow generalisation beyond the boundaries of the observed data. To this extent, we introduce for the first time a new interpretation of the embedded generative functions through Perturbation Theory. The embedding multiplicity is addressed by transforming the distributions into a new set of generalisable functions, while characterising their energy spectrum
under a particular energy landscape.
We outline the combination of theoretical and machine learning based methods, for moving towards interpretable and generalisable models. Developing a theoretical framework to map from genotype to phenotype, we provide both supervised and unsupervised tools to operate over single cell RNA-Seq. data. We have generated a pipeline to identify relevant genes and cell types through Variational Autoencoders (VAEs),
validating reconstructed gene expressions to prove the generative performance of the embeddings. The new interpretation of the information learned and extracted by the models de fines a label independent evaluation, particularly useful for unsupervised
learning. Lastly, we introduce a novel transformation of the generative embeddings based on quantum and perturbation theory.
Our contributions can and have been extended to new datasets, according to the nature of the tasks being explored. For instance, the combination of unsupervised learning and information theory can be applied to a variety of biological or medical data. We have trained several VAE models with additional cancer and metabolic data, proving to extract meaningful representations of the data. The perturbation theory transformation of the embedding can also lead to future research on the generative potential of Variational Autoencoders through a physics perspective, combining statistical and quantum mechanics.
We believe that machine learning will only continue its fast expansion and growth through the development of more generalisable more interpretable models.
"Prediction is very difficult, especially if it's about the future" Niels Boh
- …