2,273,255 research outputs found

    Post Selection Shrinkage Estimation for High Dimensional Data Analysis

    Full text link
    In high-dimensional data settings where pnp\gg n, many penalized regularization approaches were studied for simultaneous variable selection and estimation. However, with the existence of covariates with weak effect, many existing variable selection methods, including Lasso and its generations, cannot distinguish covariates with weak and no contribution. Thus, prediction based on a subset model of selected covariates only can be inefficient. In this paper, we propose a post selection shrinkage estimation strategy to improve the prediction performance of a selected subset model. Such a post selection shrinkage estimator (PSE) is data-adaptive and constructed by shrinking a post selection weighted ridge estimator in the direction of a selected candidate subset. Under an asymptotic distributional quadratic risk criterion, its prediction performance is explored analytically. We show that the proposed post selection PSE performs better than the post selection weighted ridge estimator. More importantly, it improves the prediction performance of any candidate subset model selected from most existing Lasso-type variable selection methods significantly. The relative performance of the post selection PSE is demonstrated by both simulation studies and real data analysis.Comment: 40 pages, 2 figures, discussion pape

    LONGITUDINAL HIGH-DIMENSIONAL DATA ANALYSIS

    Get PDF
    We develop a flexible framework for modeling high-dimensional functional and imaging data observed longitudinally. The approach decomposes the observed variability of high-dimensional observations measured at multiple visits into three additive components: a subject-specific functional random intercept that quantifies the cross-sectional variability, a subject-specific functional slope that quantifies the dynamic irreversible deformation over multiple visits, and a subject-visit specific functional deviation that quantifies exchangeable or reversible visit-to-visit changes. The proposed method is very fast, scalable to studies including ultra-high dimensional data, and can easily be adapted to and executed on modest computing infrastructures. The method is applied to the longitudinal analysis of diffusion tensor imaging (DTI) data of the corpus callosum of multiple sclerosis (MS) subjects. The study includes 176 subjects observed at 466 visits. For each subject and visit the study contains a registered DTI scan of the corpus callosum at roughly 30,000 voxels

    Viewpoints: A high-performance high-dimensional exploratory data analysis tool

    Full text link
    Scientific data sets continue to increase in both size and complexity. In the past, dedicated graphics systems at supercomputing centers were required to visualize large data sets, but as the price of commodity graphics hardware has dropped and its capability has increased, it is now possible, in principle, to view large complex data sets on a single workstation. To do this in practice, an investigator will need software that is written to take advantage of the relevant graphics hardware. The Viewpoints visualization package described herein is an example of such software. Viewpoints is an interactive tool for exploratory visual analysis of large, high-dimensional (multivariate) data. It leverages the capabilities of modern graphics boards (GPUs) to run on a single workstation or laptop. Viewpoints is minimalist: it attempts to do a small set of useful things very well (or at least very quickly) in comparison with similar packages today. Its basic feature set includes linked scatter plots with brushing, dynamic histograms, normalization and outlier detection/removal. Viewpoints was originally designed for astrophysicists, but it has since been used in a variety of fields that range from astronomy, quantum chemistry, fluid dynamics, machine learning, bioinformatics, and finance to information technology server log mining. In this article, we describe the Viewpoints package and show examples of its usage.Comment: 18 pages, 3 figures, PASP in press, this version corresponds more closely to that to be publishe

    Distribution-free factor analysis - Estimation theory and applicability to high-dimensional data

    Full text link
    We here provide a distribution-free approach to the random factor analysis model. We show that it leads to the same estimating equations as for the classical ML estimates under normality, but more easily derived, and valid also in the case of more variables than observations (p>np>n). For this case we also advocate a simple iteration method. In an illustration with p=2000p=2000 and n=22n=22 it was seen to lead to convergence after just a few iterations. We show that there is no reason to expect Heywood cases to appear, and that the factor scores will typically be precisely estimated/predicted as soon as pp is large. We state as a general conjecture that the nice behaviour is not despite p>np>n, but because p>np>n.Comment: 12 pages, 2 figure

    Statistical Methods in Topological Data Analysis for Complex, High-Dimensional Data

    Get PDF
    The utilization of statistical methods an their applications within the new field of study known as Topological Data Analysis has has tremendous potential for broadening our exploration and understanding of complex, high-dimensional data spaces. This paper provides an introductory overview of the mathematical underpinnings of Topological Data Analysis, the workflow to convert samples of data to topological summary statistics, and some of the statistical methods developed for performing inference on these topological summary statistics. The intention of this non-technical overview is to motivate statisticians who are interested in learning more about the subject.Comment: 15 pages, 7 Figures, 27th Annual Conference on Applied Statistics in Agricultur

    Projection Pursuit for Exploratory Supervised Classification

    Get PDF
    In high-dimensional data, one often seeks a few interesting low-dimensional projections that reveal important features of the data. Projection pursuit is a procedure for searching high-dimensional data for interesting low-dimensional projections via the optimization of a criterion function called the projection pursuit index. Very few projection pursuit indices incorporate class or group information in the calculation. Hence, they cannot be adequately applied in supervised classification problems to provide low-dimensional projections revealing class differences in the data . We introduce new indices derived from linear discriminant analysis that can be used for exploratory supervised classification.Data mining, Exploratory multivariate data analysis, Gene expression data, Discriminant analysis
    corecore