35,108 research outputs found
On the role of pre and post-processing in environmental data mining
The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed
Unified Representation of Molecules and Crystals for Machine Learning
Accurate simulations of atomistic systems from first principles are limited
by computational cost. In high-throughput settings, machine learning can
potentially reduce these costs significantly by accurately interpolating
between reference calculations. For this, kernel learning approaches crucially
require a single Hilbert space accommodating arbitrary atomistic systems. We
introduce a many-body tensor representation that is invariant to translations,
rotations and nuclear permutations of same elements, unique, differentiable,
can represent molecules and crystals, and is fast to compute. Empirical
evidence is presented for energy prediction errors below 1 kcal/mol for 7k
organic molecules and 5 meV/atom for 11k elpasolite crystals. Applicability is
demonstrated for phase diagrams of Pt-group/transition-metal binary systems.Comment: Revised version, minor changes throughou
From isomorphism to polymorphism: connecting interzeolite transformations to structural and graph similarity
Zeolites are nanoporous crystalline materials with abundant industrial
applications. Despite sustained research, only 235 different zeolite frameworks
have been realized out of millions of hypothetical ones predicted by
computational enumeration. Structure-property relationships in zeolite
synthesis are very complex and only marginally understood. Here, we apply
structure and graph-based unsupervised machine learning to gain insight on
zeolite frameworks and how they relate to experimentally observed polymorphism
and phase transformations. We begin by describing zeolite structures using the
Smooth Overlap of Atomic Positions method, which clusters crystals with similar
cages and density in a way consistent with traditional hand-selected composite
building units. To also account for topological differences, zeolite crystals
are represented as multigraphs and compared by isomorphism tests. We find that
fourteen different pairs and one trio of known frameworks are graph isomorphic.
Based on experimental interzeolite conversions and occurrence of competing
phases, we propose that the availability of kinetic-controlled transformations
between metastable zeolite frameworks is related to their similarity in the
graph space. When this description is applied to enumerated structures, over
3,400 hypothetical structures are found to be isomorphic to known frameworks,
and thus might be realized from their experimental counterparts. Using a
continuous similarity metric, the space of known zeolites shows additional
overlaps with experimentally observed phase transformations. Hence, graph-based
similarity approaches suggest a venue for realizing novel zeolites from
existing ones by providing a relationship between pairwise structure similarity
and experimental transformations.Comment: 11 pages, 6 figure
Data mining as a tool for environmental scientists
Over recent years a huge library of data mining algorithms has been developed to tackle a variety of problems in fields such as medical imaging and network traffic analysis. Many of these techniques are far more flexible than more classical modelling approaches and could be usefully applied to data-rich environmental problems. Certain techniques such as Artificial Neural Networks, Clustering, Case-Based Reasoning and more recently Bayesian Decision Networks have found application in environmental modelling while other methods, for example classification and association rule extraction, have not yet been taken up on any wide scale. We propose that these and other data mining techniques could be usefully applied to difficult problems in the field. This paper introduces several data mining concepts and briefly discusses their application to environmental modelling, where data may be sparse, incomplete, or heterogenous
Data-driven discovery of coordinates and governing equations
The discovery of governing equations from scientific data has the potential
to transform data-rich fields that lack well-characterized quantitative
descriptions. Advances in sparse regression are currently enabling the
tractable identification of both the structure and parameters of a nonlinear
dynamical system from data. The resulting models have the fewest terms
necessary to describe the dynamics, balancing model complexity with descriptive
ability, and thus promoting interpretability and generalizability. This
provides an algorithmic approach to Occam's razor for model discovery. However,
this approach fundamentally relies on an effective coordinate system in which
the dynamics have a simple representation. In this work, we design a custom
autoencoder to discover a coordinate transformation into a reduced space where
the dynamics may be sparsely represented. Thus, we simultaneously learn the
governing equations and the associated coordinate system. We demonstrate this
approach on several example high-dimensional dynamical systems with
low-dimensional behavior. The resulting modeling framework combines the
strengths of deep neural networks for flexible representation and sparse
identification of nonlinear dynamics (SINDy) for parsimonious models. It is the
first method of its kind to place the discovery of coordinates and models on an
equal footing.Comment: 25 pages, 6 figures; added acknowledgment
Hierarchical Visualization of Materials Space with Graph Convolutional Neural Networks
The combination of high throughput computation and machine learning has led
to a new paradigm in materials design by allowing for the direct screening of
vast portions of structural, chemical, and property space. The use of these
powerful techniques leads to the generation of enormous amounts of data, which
in turn calls for new techniques to efficiently explore and visualize the
materials space to help identify underlying patterns. In this work, we develop
a unified framework to hierarchically visualize the compositional and
structural similarities between materials in an arbitrary material space with
representations learned from different layers of graph convolutional neural
networks. We demonstrate the potential for such a visualization approach by
showing that patterns emerge automatically that reflect similarities at
different scales in three representative classes of materials: perovskites,
elemental boron, and general inorganic crystals, covering material spaces of
different compositions, structures, and both. For perovskites, elemental
similarities are learned that reflects multiple aspects of atom properties. For
elemental boron, structural motifs emerge automatically showing characteristic
boron local environments. For inorganic crystals, the similarity and stability
of local coordination environments are shown combining different center and
neighbor atoms. The method could help transition to a data-centered exploration
of materials space in automated materials design.Comment: 22 + 7 pages, 6 + 5 figure
- …