553 research outputs found
Ensemble estimation of multivariate f-divergence
f-divergence estimation is an important problem in the fields of information
theory, machine learning, and statistics. While several divergence estimators
exist, relatively few of their convergence rates are known. We derive the MSE
convergence rate for a density plug-in estimator of f-divergence. Then by
applying the theory of optimally weighted ensemble estimation, we derive a
divergence estimator with a convergence rate of O(1/T) that is simple to
implement and performs well in high dimensions. We validate our theoretical
results with experiments.Comment: 14 pages, 6 figures, a condensed version of this paper was accepted
to ISIT 2014, Version 2: Moved the proofs of the theorems from the main body
to appendices at the en
Meta learning of bounds on the Bayes classifier error
Meta learning uses information from base learners (e.g. classifiers or
estimators) as well as information about the learning problem to improve upon
the performance of a single base learner. For example, the Bayes error rate of
a given feature space, if known, can be used to aid in choosing a classifier,
as well as in feature selection and model selection for the base classifiers
and the meta classifier. Recent work in the field of f-divergence functional
estimation has led to the development of simple and rapidly converging
estimators that can be used to estimate various bounds on the Bayes error. We
estimate multiple bounds on the Bayes error using an estimator that applies
meta learning to slowly converging plug-in estimators to obtain the parametric
convergence rate. We compare the estimated bounds empirically on simulated data
and then estimate the tighter bounds on features extracted from an image patch
analysis of sunspot continuum and magnetogram images.Comment: 6 pages, 3 figures, to appear in proceedings of 2015 IEEE Signal
Processing and SP Education Worksho
Direct Estimation of Information Divergence Using Nearest Neighbor Ratios
We propose a direct estimation method for R\'{e}nyi and f-divergence measures
based on a new graph theoretical interpretation. Suppose that we are given two
sample sets and , respectively with and samples, where
is a constant value. Considering the -nearest neighbor (-NN)
graph of in the joint data set , we show that the average powered
ratio of the number of points to the number of points among all -NN
points is proportional to R\'{e}nyi divergence of and densities. A
similar method can also be used to estimate f-divergence measures. We derive
bias and variance rates, and show that for the class of -H\"{o}lder
smooth functions, the estimator achieves the MSE rate of
. Furthermore, by using a weighted ensemble
estimation technique, for density functions with continuous and bounded
derivatives of up to the order , and some extra conditions at the support
set boundary, we derive an ensemble estimator that achieves the parametric MSE
rate of . Our estimators are more computationally tractable than other
competing estimators, which makes them appealing in many practical
applications.Comment: 2017 IEEE International Symposium on Information Theory (ISIT
Information Theoretic Structure Learning with Confidence
Information theoretic measures (e.g. the Kullback Liebler divergence and
Shannon mutual information) have been used for exploring possibly nonlinear
multivariate dependencies in high dimension. If these dependencies are assumed
to follow a Markov factor graph model, this exploration process is called
structure discovery. For discrete-valued samples, estimates of the information
divergence over the parametric class of multinomial models lead to structure
discovery methods whose mean squared error achieves parametric convergence
rates as the sample size grows. However, a naive application of this method to
continuous nonparametric multivariate models converges much more slowly. In
this paper we introduce a new method for nonparametric structure discovery that
uses weighted ensemble divergence estimators that achieve parametric
convergence rates and obey an asymptotic central limit theorem that facilitates
hypothesis testing and other types of statistical validation.Comment: 10 pages, 3 figure
The intrinsic value of HFO features as a biomarker of epileptic activity
High frequency oscillations (HFOs) are a promising biomarker of epileptic
brain tissue and activity. HFOs additionally serve as a prototypical example of
challenges in the analysis of discrete events in high-temporal resolution,
intracranial EEG data. Two primary challenges are 1) dimensionality reduction,
and 2) assessing feasibility of classification. Dimensionality reduction
assumes that the data lie on a manifold with dimension less than that of the
feature space. However, previous HFO analyses have assumed a linear manifold,
global across time, space (i.e. recording electrode/channel), and individual
patients. Instead, we assess both a) whether linear methods are appropriate and
b) the consistency of the manifold across time, space, and patients. We also
estimate bounds on the Bayes classification error to quantify the distinction
between two classes of HFOs (those occurring during seizures and those
occurring due to other processes). This analysis provides the foundation for
future clinical use of HFO features and buides the analysis for other discrete
events, such as individual action potentials or multi-unit activity.Comment: 5 pages, 5 figure
Image patch analysis and clustering of sunspots: a dimensionality reduction approach
Sunspots, as seen in white light or continuum images, are associated with
regions of high magnetic activity on the Sun, visible on magnetogram images.
Their complexity is correlated with explosive solar activity and so classifying
these active regions is useful for predicting future solar activity. Current
classification of sunspot groups is visually based and suffers from bias.
Supervised learning methods can reduce human bias but fail to optimally
capitalize on the information present in sunspot images. This paper uses two
image modalities (continuum and magnetogram) to characterize the spatial and
modal interactions of sunspot and magnetic active region images and presents a
new approach to cluster the images. Specifically, in the framework of image
patch analysis, we estimate the number of intrinsic parameters required to
describe the spatial and modal dependencies, the correlation between the two
modalities and the corresponding spatial patterns, and examine the phenomena at
different scales within the images. To do this, we use linear and nonlinear
intrinsic dimension estimators, canonical correlation analysis, and
multiresolution analysis of intrinsic dimension.Comment: 5 pages, 7 figures, accepted to ICIP 201
OBCS: The Ontology of Biological and Clinical Statistics
Statistics play a critical role in biological and clinical research. To promote logically consistent representation and classification of statistical entities, we have developed the Ontology of Biological and Clinical Statistics (OBCS). OBCS extends the Ontology of Biomedical Investigations (OBI), an OBO Foundry ontology supported by some 20 communities. Currently, OBCS contains 686 terms, including 381 classes imported from OBI and 147 classes specific to OBCS. The goal of this paper is to present OBCS for community critique and to describe a number of use cases designed to illustrate its potential applications. The OBCS project and source code are available at http://obcs.googlecode.com
Image patch analysis of sunspots and active regions. II. Clustering via matrix factorization
Separating active regions that are quiet from potentially eruptive ones is a
key issue in Space Weather applications. Traditional classification schemes
such as Mount Wilson and McIntosh have been effective in relating an active
region large scale magnetic configuration to its ability to produce eruptive
events. However, their qualitative nature prevents systematic studies of an
active region's evolution for example. We introduce a new clustering of active
regions that is based on the local geometry observed in Line of Sight
magnetogram and continuum images. We use a reduced-dimension representation of
an active region that is obtained by factoring the corresponding data matrix
comprised of local image patches. Two factorizations can be compared via the
definition of appropriate metrics on the resulting factors. The distances
obtained from these metrics are then used to cluster the active regions. We
find that these metrics result in natural clusterings of active regions. The
clusterings are related to large scale descriptors of an active region such as
its size, its local magnetic field distribution, and its complexity as measured
by the Mount Wilson classification scheme. We also find that including data
focused on the neutral line of an active region can result in an increased
correspondence between our clustering results and other active region
descriptors such as the Mount Wilson classifications and the value. We
provide some recommendations for which metrics, matrix factorization
techniques, and regions of interest to use to study active regions.Comment: Accepted for publication in the Journal of Space Weather and Space
Climate (SWSC). 33 pages, 12 figure
Image patch analysis of sunspots and active regions. I. Intrinsic dimension and correlation analysis
The flare-productivity of an active region is observed to be related to its
spatial complexity. Mount Wilson or McIntosh sunspot classifications measure
such complexity but in a categorical way, and may therefore not use all the
information present in the observations. Moreover, such categorical schemes
hinder a systematic study of an active region's evolution for example. We
propose fine-scale quantitative descriptors for an active region's complexity
and relate them to the Mount Wilson classification. We analyze the local
correlation structure within continuum and magnetogram data, as well as the
cross-correlation between continuum and magnetogram data. We compute the
intrinsic dimension, partial correlation, and canonical correlation analysis
(CCA) of image patches of continuum and magnetogram active region images taken
from the SOHO-MDI instrument. We use masks of sunspots derived from continuum
as well as larger masks of magnetic active regions derived from the magnetogram
to analyze separately the core part of an active region from its surrounding
part. We find the relationship between complexity of an active region as
measured by Mount Wilson and the intrinsic dimension of its image patches.
Partial correlation patterns exhibit approximately a third-order Markov
structure. CCA reveals different patterns of correlation between continuum and
magnetogram within the sunspots and in the region surrounding the sunspots.
These results also pave the way for patch-based dictionary learning with a view
towards automatic clustering of active regions.Comment: Accepted for publication in the Journal of Space Weather and Space
Climate (SWSC). 23 pages, 11 figure
- …