223,085 research outputs found

    Dimension reduction via principal variables

    Get PDF
    For many large-scale datasets it is necessary to reduce dimensionality to the point where further exploration and analysis can take place. Principal variables are a subset of the original variables and preserve, to some extent, the structure and information carried by the original variables. Dimension reduction using principal variables is considered and a novel algorithm for determining such principal variables is proposed. This method is tested and compared with 11 other variable selection methods from the literature in a simulation study and is shown to be highly effective. Extensions to this procedure are also developed, including a method to determine longitudinal principal variables for repeated measures data, and a technique for incorporating utilities in order to modify the selection process. The method is further illustrated with real datasets, including some larger UK data relating to patient outcome after total knee replacement

    Neural networks principal component analysis for estimating the generative multifactor model of returns under a statistical approach to the arbitrage pricing theory: Evidence from the mexican stock exchange

    Get PDF
    A nonlinear principal component analysis (NLPCA) represents an extension of the standard principal component analysis (PCA) that overcomes the limitation of the PCA’s assumption about the linearity of the model. The NLPCA belongs to the family of nonlinear versions of dimension reduction or the extraction techniques of underlying features, including nonlinear factor analysis and nonlinear independent component analysis, where the principal components are generalized from straight lines to curves. The NLPCA can be achieved via an artificial neural network specification where the PCA classic model is generalized to a nonlinear mode, namely, Neural Networks Principal Component Analysis (NNPCA). In order to extract a set of nonlinear underlying systematic risk factors, we estimate the generative multifactor model of returns in a statistical version of the Arbitrage Pricing Theory (APT), in the context of the Mexican Stock Exchange. We used an auto-associative multilayer perceptron neural network or autoencoder, where the ‘bottleneck’ layer represented the nonlinear principal components, or in our context, the scores of the underlying factors of systematic risk. This neural network represents a powerful technique capable of performing a nonlinear transformation of the observed variables into the nonlinear principal components, and to execute a nonlinear mapping that reproduces the original variables. We propose a network architecture capable of generating a loading matrix that enables us to make a first approach to the interpretation of the extracted latent risk factors. In addition, we used a two stage methodology for the econometric contrast of the APT involving first, a simultaneous estimation of the system of equations via Seemingly Unrelated Regression (SUR), and secondly, a cross-section estimation via Ordinary Least Squared corrected by heteroskedasticity and autocorrelation by means of the Newey-West heteroskedasticity and autocorrelation consistent covariances estimates (HEC). The evidence found shows that the reproductions of the observed returns using the estimated components via NNPCA are suitable in almost all cases; nevertheless, the results in an econometric contrast lead us to a partial acceptance of the APT in the samples and periods studied.Peer ReviewedPostprint (published version

    Neural Networks Principal Component Analysis for Estimating the Generative Multifactor Model of Returns under a Statistical Approach to the Arbitrage Pricing Theory: Evidence from the Mexican Stock Exchange

    Full text link
    A nonlinear principal component analysis (NLPCA) represents an extension of the standard principal component analysis (PCA) that overcomes the limitation of the PCA's assumption about the linearity of the model. The NLPCA belongs to the family of nonlinear versions of dimension reduction or the extraction techniques of underlying features, including nonlinear factor analysis and nonlinear independent component analysis, where the principal components are generalized from straight lines to curves. The NLPCA can be achieved via an artificial neural network specification where the PCA classic model is generalized to a nonlinear mode, namely, Neural Networks Principal Component Analysis (NNPCA). In order to extract a set of nonlinear underlying systematic risk factors, we estimate the generative multifactor model of returns in a statistical version of the Arbitrage Pricing Theory (APT), in the context of the Mexican Stock Exchange. We used an auto-associative multilayer perceptron neural network or autoencoder, where the 'bottleneck' layer represented the nonlinear principal components, or in our context, the scores of the underlying factors of systematic risk. This neural network represents a powerful technique capable of performing a nonlinear transformation of the observed variables into the nonlinear principal components, and to execute a nonlinear mapping that reproduces the original variables. We propose a network architecture capable of generating a loading matrix that enables us to make a first approach to the interpretation of the extracted latent risk factors. In addition, we used a two stage methodology for the econometric contrast of the APT involving first, a simultaneous estimation of the system of equations via Seemingly Unrelated Regression (SUR), and secondly, a cross-section estimation via Ordinary Least Squared corrected by heteroskedasticity and autocorrelation by means of the Newey-West heteroskedasticity and autocorrelation consistent covariances estimates (HEC). The evidence found shows that the reproductions of the observed returns using the estimated components via NNPCA are suitable in almost all cases; nevertheless, the results in an econometric contrast lead us to a partial acceptance of the APT in the samples and periods studied

    A statistical downscaling framework for environmental mapping

    Get PDF
    In recent years, knowledge extraction from data has become increasingly popular, with many numerical forecasting models, mainly falling into two major categories—chemical transport models (CTMs) and conventional statistical methods. However, due to data and model variability, data-driven knowledge extraction from high-dimensional, multifaceted data in such applications require generalisations of global to regional or local conditions. Typically, generalisation is achieved via mapping global conditions to local ecosystems and human habitats which amounts to tracking and monitoring environmental dynamics in various geographical areas and their regional and global implications on human livelihood. Statistical downscaling techniques have been widely used to extract high-resolution information from regional-scale variables produced by CTMs in climate model. Conventional applications of these methods are predominantly dimensional reduction in nature, designed to reduce spatial dimension of gridded model outputs without loss of essential spatial information. Their downside is twofold—complete dependence on unlabelled design matrix and reliance on underlying distributional assumptions. We propose a novel statistical downscaling framework for dealing with data and model variability. Its power derives from training and testing multiple models on multiple samples, narrowing down global environmental phenomena to regional discordance through dimensional reduction and visualisation. Hourly ground-level ozone observations were obtained from various environmental stations maintained by the US Environmental Protection Agency, covering the summer period (June–August 2005). Regional patterns of ozone are related to local observations via repeated runs and performance assessment of multiple versions of empirical orthogonal functions or principal components and principal fitted components via an algorithm with fully adaptable parameters. We demonstrate how the algorithm can be extended to weather-dependent and other applications with inherent data randomness and model variability via its built-in interdisciplinary computational power that connects data sources with end-users

    Learning Methods in Reproducing Kernel Hilbert Space Based on High-dimensional Features

    Get PDF
    The first topic focuses on the dimension reduction method via the regularization. We propose the selection for principle components via LASSO. This method assumes that some unknown latent variables are related to the response under the highly correlate covariate structure. L 1 regularization plays a key role in adaptively finding a few liner combinations in contrast to the persistent idea that is to employ a few leading principal components. The consistency of regression coefficients and selected model are asymptotically proved and numerical performances are shown to support our suggestion. The proposed method is applied to analyze microarray data and cancer data. Second and third topics focus on the approaches of the independent screening and the dimension reduction with the machine learning approach using positive definite kernels. A Key ingredient matter of these papers is to use reproducing kernel Hilbert space (RKHS) theory. Specifically, we proposed Multiple Projection Model (MPM) and Single Index Latent Factor Model (SILFM) to build an accurate prediction model for clinical outcomes based on a massive number of features. MPM and SILFM can be summarized as three-stage estimation, screening, dimension reduction, and nonlinear fitting. Screening and dimension reduction are unique approaches of two novel methods. The convergence property of the proposed screening method and the risk bound for SILFM are systematically investigated. The results from several simulation scenarios are shown to support it. The proposed method is applied to analyze brain image data and its clinical behavior response.Doctor of Philosoph

    Methods for Estimation of Intrinsic Dimensionality

    Get PDF
    Dimension reduction is an important tool used to describe the structure of complex data (explicitly or implicitly) through a small but sufficient number of variables, and thereby make data analysis more efficient. It is also useful for visualization purposes. Dimension reduction helps statisticians to overcome the ‘curse of dimensionality’. However, most dimension reduction techniques require the intrinsic dimension of the low-dimensional subspace to be fixed in advance. The availability of reliable intrinsic dimension (ID) estimation techniques is of major importance. The main goal of this thesis is to develop algorithms for determining the intrinsic dimensions of recorded data sets in a nonlinear context. Whilst this is a well-researched topic for linear planes, based mainly on principal components analysis, relatively little attention has been paid to ways of estimating this number for non–linear variable interrelationships. The proposed algorithms here are based on existing concepts that can be categorized into local methods, relying on randomly selected subsets of a recorded variable set, and global methods, utilizing the entire data set. This thesis provides an overview of ID estimation techniques, with special consideration given to recent developments in non–linear techniques, such as charting manifold and fractal–based methods. Despite their nominal existence, the practical implementation of these techniques is far from straightforward. The intrinsic dimension is estimated via Brand’s algorithm by examining the growth point process, which counts the number of points in hyper-spheres. The estimation needs to determine the starting point for each hyper-sphere. In this thesis we provide settings for selecting starting points which work well for most data sets. Additionally we propose approaches for estimating dimensionality via Brand’s algorithm, the Dip method and the Regression method. Other approaches are proposed for estimating the intrinsic dimension by fractal dimension estimation methods, which exploit the intrinsic geometry of a data set. The most popular concept from this family of methods is the correlation dimension, which requires the estimation of the correlation integral for a ball of radius tending to 0. In this thesis we propose new approaches to approximate the correlation integral in this limit. The new approaches are the Intercept method, the Slop method and the Polynomial method. In addition we propose a new approach, a localized global method, which could be defined as a local version of global ID methods. The objective of the localized global approach is to improve the algorithm based on a local ID method, which could significantly reduce the negative bias. Experimental results on real world and simulated data are used to demonstrate the algorithms and compare them to other methodology. A simulation study which verifies the effectiveness of the proposed methods is also provided. Finally, these algorithms are contrasted using a recorded data set from an industrial melter process

    Manifold Elastic Net: A Unified Framework for Sparse Dimension Reduction

    Full text link
    It is difficult to find the optimal sparse solution of a manifold learning based dimensionality reduction algorithm. The lasso or the elastic net penalized manifold learning based dimensionality reduction is not directly a lasso penalized least square problem and thus the least angle regression (LARS) (Efron et al. \cite{LARS}), one of the most popular algorithms in sparse learning, cannot be applied. Therefore, most current approaches take indirect ways or have strict settings, which can be inconvenient for applications. In this paper, we proposed the manifold elastic net or MEN for short. MEN incorporates the merits of both the manifold learning based dimensionality reduction and the sparse learning based dimensionality reduction. By using a series of equivalent transformations, we show MEN is equivalent to the lasso penalized least square problem and thus LARS is adopted to obtain the optimal sparse solution of MEN. In particular, MEN has the following advantages for subsequent classification: 1) the local geometry of samples is well preserved for low dimensional data representation, 2) both the margin maximization and the classification error minimization are considered for sparse projection calculation, 3) the projection matrix of MEN improves the parsimony in computation, 4) the elastic net penalty reduces the over-fitting problem, and 5) the projection matrix of MEN can be interpreted psychologically and physiologically. Experimental evidence on face recognition over various popular datasets suggests that MEN is superior to top level dimensionality reduction algorithms.Comment: 33 pages, 12 figure

    Prediction with Dimension Reduction of Multiple Molecular Data Sources for Patient Survival

    Full text link
    Predictive modeling from high-dimensional genomic data is often preceded by a dimension reduction step, such as principal components analysis (PCA). However, the application of PCA is not straightforward for multi-source data, wherein multiple sources of 'omics data measure different but related biological components. In this article we utilize recent advances in the dimension reduction of multi-source data for predictive modeling. In particular, we apply exploratory results from Joint and Individual Variation Explained (JIVE), an extension of PCA for multi-source data, for prediction of differing response types. We conduct illustrative simulations to illustrate the practical advantages and interpretability of our approach. As an application example we consider predicting survival for Glioblastoma Multiforme (GBM) patients from three data sources measuring mRNA expression, miRNA expression, and DNA methylation. We also introduce a method to estimate JIVE scores for new samples that were not used in the initial dimension reduction, and study its theoretical properties; this method is implemented in the R package R.JIVE on CRAN, in the function 'jive.predict'.Comment: 11 pages, 9 figure
    • …
    corecore