7,766 research outputs found
Kernel Multivariate Analysis Framework for Supervised Subspace Learning: A Tutorial on Linear and Kernel Multivariate Methods
Feature extraction and dimensionality reduction are important tasks in many
fields of science dealing with signal processing and analysis. The relevance of
these techniques is increasing as current sensory devices are developed with
ever higher resolution, and problems involving multimodal data sources become
more common. A plethora of feature extraction methods are available in the
literature collectively grouped under the field of Multivariate Analysis (MVA).
This paper provides a uniform treatment of several methods: Principal Component
Analysis (PCA), Partial Least Squares (PLS), Canonical Correlation Analysis
(CCA) and Orthonormalized PLS (OPLS), as well as their non-linear extensions
derived by means of the theory of reproducing kernel Hilbert spaces. We also
review their connections to other methods for classification and statistical
dependence estimation, and introduce some recent developments to deal with the
extreme cases of large-scale and low-sized problems. To illustrate the wide
applicability of these methods in both classification and regression problems,
we analyze their performance in a benchmark of publicly available data sets,
and pay special attention to specific real applications involving audio
processing for music genre prediction and hyperspectral satellite images for
Earth and climate monitoring
Projected Estimators for Robust Semi-supervised Classification
For semi-supervised techniques to be applied safely in practice we at least
want methods to outperform their supervised counterparts. We study this
question for classification using the well-known quadratic surrogate loss
function. Using a projection of the supervised estimate onto a set of
constraints imposed by the unlabeled data, we find we can safely improve over
the supervised solution in terms of this quadratic loss. Unlike other
approaches to semi-supervised learning, the procedure does not rely on
assumptions that are not intrinsic to the classifier at hand. It is
theoretically demonstrated that, measured on the labeled and unlabeled training
data, this semi-supervised procedure never gives a lower quadratic loss than
the supervised alternative. To our knowledge this is the first approach that
offers such strong, albeit conservative, guarantees for improvement over the
supervised solution. The characteristics of our approach are explicated using
benchmark datasets to further understand the similarities and differences
between the quadratic loss criterion used in the theoretical results and the
classification accuracy often considered in practice.Comment: 13 pages, 2 figures, 1 tabl
Flexible Graph-based Learning with Applications to Genetic Data Analysis
With the abundance of increasingly complex and high dimensional data in many scientific disciplines, graphical models have become an extremely useful statistical tool to explore data structures. In this dissertation, we study graphical models from two perspectives: i) to enhance supervised learning, classification in particular, and ii) graphical model estimation for specific data types. For classification, the optimal classifier is often connected with the feature structure within each class. In the first project, starting from the Gaussian population scenario, we aim to find an approach to utilize the graphical structure information of the features in classification. With respect to graphical models, many existing graphical estimation methods have been proposed based on a homogeneous Gaussian population. Due to the Gaussian assumption, these methods may not be suitable for many typical genetic data. For instance, the gene expression data may come from individuals of multiple populations with possibly distinct graphical structures. Another instance would be the single cell RNA-sequencing data, which are featured by substantial sample dependence and zero-inflation. In the second and the third project, we propose multiple graphical model estimation methods for these scenarios respectively. In particular, two dependent count-data graphical models are introduced for the latter case. Both numerical and theoretical studies are performed to demonstrate the effectiveness of these methods.Doctor of Philosoph
Justifying Information-Geometric Causal Inference
Information Geometric Causal Inference (IGCI) is a new approach to
distinguish between cause and effect for two variables. It is based on an
independence assumption between input distribution and causal mechanism that
can be phrased in terms of orthogonality in information space. We describe two
intuitive reinterpretations of this approach that makes IGCI more accessible to
a broader audience.
Moreover, we show that the described independence is related to the
hypothesis that unsupervised learning and semi-supervised learning only works
for predicting the cause from the effect and not vice versa.Comment: 3 Figure
- …