28 research outputs found
Big Data of Materials Science - Critical Role of the Descriptor
Statistical learning of materials properties or functions so far starts with
a largely silent, non-challenged step: the choice of the set of descriptive
parameters (termed descriptor). However, when the scientific connection between
the descriptor and the actuating mechanisms is unclear, causality of the
learned descriptor-property relation is uncertain. Thus, trustful prediction of
new promising materials, identification of anomalies, and scientific
advancement are doubtful. We analyse this issue and define requirements for a
suited descriptor. For a classical example, the energy difference of
zincblende/wurtzite and rocksalt semiconductors, we demonstrate how a
meaningful descriptor can be found systematically.Comment: Accepted to Phys. Rev. Let
Learning physical descriptors for materials science by compressed sensing
The availability of big data in materials science offers new routes for
analyzing materials properties and functions and achieving scientific
understanding. Finding structure in these data that is not directly visible by
standard tools and exploitation of the scientific information requires new and
dedicated methodology based on approaches from statistical learning, compressed
sensing, and other recent methods from applied mathematics, computer science,
statistics, signal processing, and information science. In this paper, we
explain and demonstrate a compressed-sensing based methodology for feature
selection, specifically for discovering physical descriptors, i.e., physical
parameters that describe the material and its properties of interest, and
associated equations that explicitly and quantitatively describe those relevant
properties. As showcase application and proof of concept, we describe how to
build a physical model for the quantitative prediction of the crystal structure
of binary compound semiconductors
Function spaces with dominating mixed smoothness
We study several techniques whichare well known in the case of Besov and TriebelLizorkin spaces and extend them to spaces with dominating mixed smoothness. We use the ideas of Triebel to prove three important decomposition theorems. We deal withsocalled atomic, subatomic and wavelet decompositions. All these theorems have much in common. fRoughly speaking, they say that a function belongs to some function space if, and only if, it can be decomposed into the sum of products of coefficients and corresponding building blocks, where the coefficients belong to an appropriate sequence space. These decomposition theorems estabilisha veryusefulconnection between function and sequence spaces. We use them in the study of the decay of entropy numbers of compact embeddings between two function spaces of dominating mixed smoothness reducingthis problem to the same question on the sequence space level. The considered scales cover many important specific spaces (Sobolev, Zygmund, Besov) and we get generalisations of respective assertions of Belinsky, Dinh Dung and Temlyakov
Sparse Proteomics Analysis - A compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data
Background: High-throughput proteomics techniques, such as mass spectrometry
(MS)-based approaches, produce very high-dimensional data-sets. In a clinical
setting one is often interested in how mass spectra differ between patients of
different classes, for example spectra from healthy patients vs. spectra from
patients having a particular disease. Machine learning algorithms are needed to
(a) identify these discriminating features and (b) classify unknown spectra
based on this feature set. Since the acquired data is usually noisy, the
algorithms should be robust against noise and outliers, while the identified
feature set should be as small as possible.
Results: We present a new algorithm, Sparse Proteomics Analysis (SPA), based
on the theory of compressed sensing that allows us to identify a minimal
discriminating set of features from mass spectrometry data-sets. We show (1)
how our method performs on artificial and real-world data-sets, (2) that its
performance is competitive with standard (and widely used) algorithms for
analyzing proteomics data, and (3) that it is robust against random and
systematic noise. We further demonstrate the applicability of our algorithm to
two previously published clinical data-sets
