7,766 research outputs found

    Kernel Multivariate Analysis Framework for Supervised Subspace Learning: A Tutorial on Linear and Kernel Multivariate Methods

    Full text link
    Feature extraction and dimensionality reduction are important tasks in many fields of science dealing with signal processing and analysis. The relevance of these techniques is increasing as current sensory devices are developed with ever higher resolution, and problems involving multimodal data sources become more common. A plethora of feature extraction methods are available in the literature collectively grouped under the field of Multivariate Analysis (MVA). This paper provides a uniform treatment of several methods: Principal Component Analysis (PCA), Partial Least Squares (PLS), Canonical Correlation Analysis (CCA) and Orthonormalized PLS (OPLS), as well as their non-linear extensions derived by means of the theory of reproducing kernel Hilbert spaces. We also review their connections to other methods for classification and statistical dependence estimation, and introduce some recent developments to deal with the extreme cases of large-scale and low-sized problems. To illustrate the wide applicability of these methods in both classification and regression problems, we analyze their performance in a benchmark of publicly available data sets, and pay special attention to specific real applications involving audio processing for music genre prediction and hyperspectral satellite images for Earth and climate monitoring

    Projected Estimators for Robust Semi-supervised Classification

    Get PDF
    For semi-supervised techniques to be applied safely in practice we at least want methods to outperform their supervised counterparts. We study this question for classification using the well-known quadratic surrogate loss function. Using a projection of the supervised estimate onto a set of constraints imposed by the unlabeled data, we find we can safely improve over the supervised solution in terms of this quadratic loss. Unlike other approaches to semi-supervised learning, the procedure does not rely on assumptions that are not intrinsic to the classifier at hand. It is theoretically demonstrated that, measured on the labeled and unlabeled training data, this semi-supervised procedure never gives a lower quadratic loss than the supervised alternative. To our knowledge this is the first approach that offers such strong, albeit conservative, guarantees for improvement over the supervised solution. The characteristics of our approach are explicated using benchmark datasets to further understand the similarities and differences between the quadratic loss criterion used in the theoretical results and the classification accuracy often considered in practice.Comment: 13 pages, 2 figures, 1 tabl

    Flexible Graph-based Learning with Applications to Genetic Data Analysis

    Get PDF
    With the abundance of increasingly complex and high dimensional data in many scientific disciplines, graphical models have become an extremely useful statistical tool to explore data structures. In this dissertation, we study graphical models from two perspectives: i) to enhance supervised learning, classification in particular, and ii) graphical model estimation for specific data types. For classification, the optimal classifier is often connected with the feature structure within each class. In the first project, starting from the Gaussian population scenario, we aim to find an approach to utilize the graphical structure information of the features in classification. With respect to graphical models, many existing graphical estimation methods have been proposed based on a homogeneous Gaussian population. Due to the Gaussian assumption, these methods may not be suitable for many typical genetic data. For instance, the gene expression data may come from individuals of multiple populations with possibly distinct graphical structures. Another instance would be the single cell RNA-sequencing data, which are featured by substantial sample dependence and zero-inflation. In the second and the third project, we propose multiple graphical model estimation methods for these scenarios respectively. In particular, two dependent count-data graphical models are introduced for the latter case. Both numerical and theoretical studies are performed to demonstrate the effectiveness of these methods.Doctor of Philosoph

    Justifying Information-Geometric Causal Inference

    Full text link
    Information Geometric Causal Inference (IGCI) is a new approach to distinguish between cause and effect for two variables. It is based on an independence assumption between input distribution and causal mechanism that can be phrased in terms of orthogonality in information space. We describe two intuitive reinterpretations of this approach that makes IGCI more accessible to a broader audience. Moreover, we show that the described independence is related to the hypothesis that unsupervised learning and semi-supervised learning only works for predicting the cause from the effect and not vice versa.Comment: 3 Figure
    corecore