12 research outputs found

    Sparse kernel orthonormalized PLS for feature extraction in large datasets

    Get PDF
    In this paper we are presenting a novel multivariate analysis method. Our scheme is based on a novel kernel orthonormalized partial least squares (PLS) variant for feature extraction, imposing sparsity constrains in the solution to improve scalability. The algorithm is tested on a benchmark of UCI data sets, and on the analysis of integrated short-time music features for genre prediction. The upshot is that the method has strong expressive power even with rather few features, is clearly outperforming the ordinary kernel PLS, and therefore is an appealing method for feature extraction of labelled data

    Kernel Multivariate Analysis Framework for Supervised Subspace Learning: A Tutorial on Linear and Kernel Multivariate Methods

    Full text link
    Feature extraction and dimensionality reduction are important tasks in many fields of science dealing with signal processing and analysis. The relevance of these techniques is increasing as current sensory devices are developed with ever higher resolution, and problems involving multimodal data sources become more common. A plethora of feature extraction methods are available in the literature collectively grouped under the field of Multivariate Analysis (MVA). This paper provides a uniform treatment of several methods: Principal Component Analysis (PCA), Partial Least Squares (PLS), Canonical Correlation Analysis (CCA) and Orthonormalized PLS (OPLS), as well as their non-linear extensions derived by means of the theory of reproducing kernel Hilbert spaces. We also review their connections to other methods for classification and statistical dependence estimation, and introduce some recent developments to deal with the extreme cases of large-scale and low-sized problems. To illustrate the wide applicability of these methods in both classification and regression problems, we analyze their performance in a benchmark of publicly available data sets, and pay special attention to specific real applications involving audio processing for music genre prediction and hyperspectral satellite images for Earth and climate monitoring

    Optimized parameter search for large datasets of the regularization parameter and feature selection for ridge regression

    Get PDF
    In this paper we propose mathematical optimizations to select the optimal regularization parameter for ridge regression using cross-validation. The resulting algorithm is suited for large datasets and the computational cost does not depend on the size of the training set. We extend this algorithm to forward or backward feature selection in which the optimal regularization parameter is selected for each possible feature set. These feature selection algorithms yield solutions with a sparse weight matrix using a quadratic cost on the norm of the weights. A naive approach to optimizing the ridge regression parameter has a computational complexity of the order with the number of applied regularization parameters, the number of folds in the validation set, the number of input features and the number of data samples in the training set. Our implementation has a computational complexity of the order . This computational cost is smaller than that of regression without regularization for large datasets and is independent of the number of applied regularization parameters and the size of the training set. Combined with a feature selection algorithm the algorithm is of complexity and for forward and backward feature selection respectively, with the number of selected features and the number of removed features. This is an order faster than and for the naive implementation, with for large datasets. To show the performance and reduction in computational cost, we apply this technique to train recurrent neural networks using the reservoir computing approach, windowed ridge regression, least-squares support vector machines (LS-SVMs) in primal space using the fixed-size LS-SVM approximation and extreme learning machines

    SlimPLS: A Method for Feature Selection in Gene Expression-Based Disease Classification

    Get PDF
    A major challenge in biomedical studies in recent years has been the classification of gene expression profiles into categories, such as cases and controls. This is done by first training a classifier by using a labeled training set containing labeled samples from the two populations, and then using that classifier to predict the labels of new samples. Such predictions have recently been shown to improve the diagnosis and treatment selection practices for several diseases. This procedure is complicated, however, by the high dimensionality if the data. While microarrays can measure the levels of thousands of genes per sample, case-control microarray studies usually involve no more than several dozen samples. Standard classifiers do not work well in these situations where the number of features (gene expression levels measured in these microarrays) far exceeds the number of samples. Selecting only the features that are most relevant for discriminating between the two categories can help construct better classifiers, in terms of both accuracy and efficiency. In this work we developed a novel method for multivariate feature selection based on the Partial Least Squares algorithm. We compared the method's variants with common feature selection techniques across a large number of real case-control datasets, using several classifiers. We demonstrate the advantages of the method and the preferable combinations of classifier and feature selection technique

    Sparse and kernel OPLS feature extraction based on eigenvalue problem solving

    Get PDF
    Orthonormalized partial least squares (OPLS) is a popular multivariate analysis method to perform supervised feature extraction. Usually, in machine learning papers OPLS projections are obtained by solving a generalized eigenvalue problem. However, in statistical papers the method is typically formulated in terms of a reduced-rank regression problem, leading to a formulation based on a standard eigenvalue decomposition. A first contribution of this paper is to derive explicit expressions for matching the OPLS solutions derived under both approaches and discuss that the standard eigenvalue formulation is also normally more convenient for feature extraction in machine learning. More importantly, since optimization with respect to the projection vectors is carried out without constraints via a minimization problem, inclusion of penalty terms that favor sparsity is straightforward. In the paper, we exploit this fact to propose modified versions of OPLS. In particular, relying on the ℓ1 norm, we propose a sparse version of linear OPLS, as well as a non-linear kernel OPLS with pattern selection. We also incorporate a group-lasso penalty to derive an OPLS method with true feature selection. The discriminative power of the proposed methods is analyzed on a benchmark of classification problems. Furthermore, we compare the degree of sparsity achieved by our methods and compare them with other state-of-the-art methods for sparse feature extraction.This work was partly supported by MINECO projects TEC2011-22480 and PRIPIBIN-2011-1266.Publicad

    Modelos de Regresión PLS Aplicados a Variables Educativas

    Get PDF
    Se realiza un estudio sobre el comportamiento de variables educativas asociadas a indicadores de alta calidad en los programas de pregrado de la Universidad Nacional de Colombia - Sede Manizales, con el objeto de crear modelos de regresión multivariado que permitan proyectar comportamientos futuros y establecer relaciones entre las variables asociadas a los factores estudiantes, docencia, procesos académicos e investigación, y las relacionadas con garantía, reconocimiento y aseguramiento de la calidad. Para la construcción y estructuración de los modelos se utilizan las técnicas PLS y Kernel PLS, haciendo los ajustes y validaciones pertinentes. Se obtienen modelos que contribuyen al mejoramiento de los  aspectos predictivo y explicativo conjuntamente, obteniendo información interesante para tomar decisiones en los ámbitos académico y administrativo

    Modelos de Regresión PLS Aplicados a Variables Educativas

    Get PDF
    Se realiza un estudio sobre el comportamiento de variables educativas asociadas a indicadores de alta calidad en los programas de pregrado de la Universidad Nacional de Colombia - Sede Manizales, con el objeto de crear modelos de regresión multivariado que permitan proyectar comportamientos futuros y establecer relaciones entre las variables asociadas a los factores estudiantes, docencia, procesos académicos e investigación, y las relacionadas con garantía, reconocimiento y aseguramiento de la calidad. Para la construcción y estructuración de los modelos se utilizan las técnicas PLS y Kernel PLS, haciendo los ajustes y validaciones pertinentes. Se obtienen modelos que contribuyen al mejoramiento de los  aspectos predictivo y explicativo conjuntamente, obteniendo información interesante para tomar decisiones en los ámbitos académico y administrativo

    Modelos de Regresión PLS Aplicados a Variables Educativas

    Get PDF
    Se realiza un estudio sobre el comportamiento de variables educativas asociadas a indicadores de alta calidad en los programas de pregrado de la Universidad Nacional de Colombia - Sede Manizales, con el objeto de crear modelos de regresión multivariado que permitan proyectar comportamientos futuros y establecer relaciones entre las variables asociadas a los factores estudiantes, docencia, procesos académicos e investigación, y las relacionadas con garantía, reconocimiento y aseguramiento de la calidad. Para la construcción y estructuración de los modelos se utilizan las técnicas PLS y Kernel PLS, haciendo los ajustes y validaciones pertinentes. Se obtienen modelos que contribuyen al mejoramiento de los  aspectos predictivo y explicativo conjuntamente, obteniendo información interesante para tomar decisiones en los ámbitos académico y administrativo

    Análisis multivariante: soluciones eficientes e interpretables

    Get PDF
    En la actualidad, existe una tendencia creciente de almacenar ingentes cantidades de datos con el fin de analizar y extraer algún tipo de información útil de ellos. Sin embargo, el tratamiento de los mismos no resulta trivial y la aplicación de métodos de análisis de datos puede sufrir multitud de problemas tales como sobreajuste o problemas de multicolinealidades causados por la existencia de variables altamente correladas. Por ello, una etapa previa de extracción de características que permita reducir la dimensionalidad de los datos y eliminar dichas multicolinealidades perjudiciales entre variables es crucial para poder aplicar de manera adecuada y eficiente dichas técnicas de análisis de datos. En particular, los métodos de análisis multivariante (MVA) –que permiten extraer un nuevo conjunto de características representativas del problema– gozan de amplia popularidad y han sido aplicados con éxito en una gran cantidad de aplicaciones del mundo real. No obstante, cuando el objetivo consiste en obtener conocimiento de los datos capturados, no solo se requieren buenas prestaciones del sistema diseñado, sino también la capacidad de producir soluciones interpretables que permitan una mejor comprensión del problema. Por lo tanto, resulta deseable modificar estos métodos MVA aportándoles una especialización de las necesidades del problema con el fin de obtener dicha interpretabilidad. En esta tesis doctoral, se estudian en detalle los métodos MVA y se presenta un marco general que engloba a dichos métodos MVA –en particular, a aquellos que obtienen características ortogonales entre sí–. Este estudio en profundidad permite una extensión de dicho marco general que facilita la inclusión de restricciones adicionales con el fin de proporcionarles habilidades adicionales, como, por ejemplo, la deseada capacidad de interpretabilidad. Para demostrar la versatilidad de este marco, se proponen soluciones MVA especializadas a cuatro casos particulares que requieren una interpretación completamente distinta del problema: soluciones MVA dispersas en las características extraídas; soluciones MVA dispersas en características extraídas a partir de relaciones no lineales entre variables; soluciones MVA que permiten la selección de las variables relevantes; y soluciones MVA no negativas para el diseño supervisado de bancos de filtros. Aunque en la literatura se pueden encontrar algunas soluciones especializadas, aquí se demuestra tanto teórica como experimentalmente que presentan graves problemas tanto de inicialización como de concepto en términos de poder ser considerados auténticos métodos MVA. La validez de las propuestas presentadas en esta tesis doctoral es certificada mediante una serie de experimentos que hacen uso de datos obtenidos del mundo real.Currently, there is a growing tendency to store large amounts of data to analyze and extract any useful information from them. However, treating them is not trivial and application of data analysis methods can suffer several problems such as overfitting or multicollinearity problems caused by the existence of highly correlated variables. Therefore, a preliminar feature extraction stage that reduces the dimensionality of the data and eliminates these harmful multicollinearities between variables is crucial to apply these techniques for data analysis in an appropriate and efficient way. In particular, multivariate analysis methods (MVA) –which allow to extract a new set of representative features of the problem– enjoy wide popularity and have been successfully applied in a large number of real-world applications. However, when the aim is to obtain knowledge of the captured data, and not just good performance of the designed system, the ability to produce interpretable solutions for a better understanding of the problem is required. Therefore, it is desirable to modify these MVA methods to provide them with specialization of problem needs to obtain such interpretability. In this thesis, we study in detail MVA methods and we present a general framework that encompasses them –in particular, those who obtain orthogonal features–. This in-depth study allows an extension of the general framework that facilitates the inclusion of additional constraints in order to provide additional properties, for example, the desired interpretability. To demonstrate the versatility of this framework, MVA specialized solutions to four particular cases that require completely different interpretation of the problem are proposed: sparse MVA solutions in the extracted features; sparse MVA solutions in extracted features from nonlinear relationships among variables; MVA solutions that allow the selection of the relevant variables; and non-negative MVA solutions for supervised design of filter banks. Although some specialized solutions can be found in the literature, here it is proven both theoretically and experimentally that they suffer serious problems of initialization and concept in terms of being considered authentic MVA methods. The legitimacy of the presented proposals in this thesis is certified through a series of experiments that use real-world data.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: José Luis Rojo Álvarez.- Secretario: José Miguel Leiva Murillo.- Vocal: Stevan Van Vaerenberg
    corecore