32 research outputs found

    A Selective Review of Group Selection in High-Dimensional Models

    Full text link
    Grouping structures arise naturally in many statistical modeling problems. Several methods have been proposed for variable selection that respect grouping structure in variables. Examples include the group LASSO and several concave group selection methods. In this article, we give a selective review of group selection concerning methodological developments, theoretical properties and computational algorithms. We pay particular attention to group selection methods involving concave penalties. We address both group selection and bi-level selection methods. We describe several applications of these methods in nonparametric additive models, semiparametric regression, seemingly unrelated regressions, genomic data analysis and genome wide association studies. We also highlight some issues that require further study.Comment: Published in at http://dx.doi.org/10.1214/12-STS392 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Model-Free Variable Screening, Sparse Regression Analysis and Other Applications with Optimal Transformations

    Get PDF
    Variable screening and variable selection methods play important roles in modeling high dimensional data. Variable screening is the process of filtering out irrelevant variables, with the aim to reduce the dimensionality from ultrahigh to high while retaining all important variables. Variable selection is the process of selecting a subset of relevant variables for use in model construction. The main theme of this thesis is to develop variable screening and variable selection methods for high dimensional data analysis. In particular, we will present two relevant methods for variable screening and selection under a unified framework based on optimal transformations. In the first part of the thesis, we develop a maximum correlation-based sure independence screening (MC-SIS) procedure to screen features in an ultrahigh-dimensional setting. We show that MC-SIS possesses the sure screen property without imposing model or distributional assumptions on the response and predictor variables. MC-SIS is a model-free method in contrast with some other existing model-based sure independence screening methods in the literature. In the second part of the thesis, we develop a novel method called SParse Optimal Transformations (SPOT) to simultaneously select important variables and explore relationships between the response and predictor variables in high dimensional nonparametric regression analysis. Not only are the optimal transformations identified by SPOT interpretable, they can also be used for response prediction. We further show that SPOT achieves consistency in both variable selection and parameter estimation. Besides variable screening and selection, we also consider other applications with optimal transformations. In the third part of the thesis, we propose several dependence measures, for both univariate and multivariate random variables, based on maximum correlation and B-spline approximation. B-spline based Maximum Correlation (BMC) and Trace BMC (T-BMC) are introduced to measure dependence between two univariate random variables. As extensions to BMC and T-BMC, Multivariate BMC (MBMC) and Trace Multivariate BMC (T-MBMC) are proposed to measure dependence between multivariate random variables. We give convergence rates for both BMC and T-BMC. Numerical simulations and real data applications are used to demonstrate the performances of proposed methods. The results show that the proposed methods outperform other existing ones and can serve as effective tools in practice

    Recent advances on the reduction and analysis of big and high-dimensional data

    Get PDF
    In an era with remarkable advancements in computer engineering, computational algorithms, and mathematical modeling, data scientists are inevitably faced with the challenge of working with big and high-dimensional data. For many problems, data reduction is a necessary first step; such reduction allows for storage and portability of big data, and enables the computation of expensive downstream quantities. The next step then involves the analysis of big data -- the use of such data for modeling, inference, and prediction. This thesis presents new methods for big data reduction and analysis, with a focus on solving real-world problems in statistics, machine learning and engineering.Ph.D

    Compact Formulations for Sparse Reconstruction in Fully and Partly Calibrated Sensor Arrays

    Get PDF
    Sensor array processing is a classical field of signal processing which offers various applications in practice, such as direction of arrival estimation or signal reconstruction, as well as a rich theory, including numerous estimation methods and statistical bounds on the achievable estimation performance. A comparably new field in signal processing is given by sparse signal reconstruction (SSR), which has attracted remarkable interest in the research community during the last years and similarly offers plentiful fields of application. This thesis considers the application of SSR in fully calibrated sensor arrays as well as in partly calibrated sensor arrays. The main contributions are a novel SSR method for application in partly calibrated arrays as well as compact formulations for the SSR problem, where special emphasis is given on exploiting specific structure in the signals as well as in the array topologies

    Sparse Model Building From Genome-Wide Variation With Graphical Models

    Full text link
    High throughput sequencing and expression characterization have lead to an explosion of phenotypic and genotypic molecular data underlying both experimental studies and outbred populations. We develop a novel class of algorithms to reconstruct sparse models among these molecular phenotypes (e.g. expression products) and genotypes (e.g. single nucleotide polymorphisms), via both a Bayesian hierarchical model, when the sample size is much smaller than the model dimension (i.e. p n) and the well characterized adaptive lasso algo- rithm. Specifically, we propose novel approaches to the problems of increasing power to detect additional loci in genome-wide association studies using our variational algorithm, efficiently learning directed cyclic graphs from expression and genotype data using the adaptive lasso, and constructing genomewide undirected graphs among genotype, expression and downstream phenotype data using an extension of the variational feature selection algorithm. The Bayesian hierarchical model is derived for a parametric multiple regression model with a mixture prior of a point mass and normal distribution for each regression coefficient, and appropriate priors for the set of hyperparameters. When combined with a probabilistic consistency bound on the model dimension, this approach leads to very sparse solutions without the need for cross validation. We use a variational Bayes approximate inference approach in our algorithm, where we impose a complete factorization across all parameters for the approximate posterior distribution, and then minimize the KullbackLeibler divergence between the approximate and true posterior distributions. Since the prior distribution is non-convex, we restart the algorithm many times to find multiple posterior modes, and combine information across all discovered modes in an approximate Bayesian model averaging framework, to reduce the variance of the posterior probability estimates. We perform analysis of three major publicly available data-sets: the HapMap 2 genotype and expression data collected on immortalized lymphoblastoid cell lines, the genome-wide gene expression and genetic marker data collected for a yeast intercross, and genomewide gene expression, genetic marker, and downstream phenotypes related to weight in a mouse F2 intercross. Based on both simulations and data analysis we show that our algorithms can outperform other state of the art model selection procedures when including thousands to hundreds of thousands of genotypes and expression traits, in terms of aggressively controlling false discovery rate, and generating rich simultaneous statistical models

    Statistical methods for the testing and estimation of linear dependence structures on paired high-dimensional data: application to genomic data

    Get PDF
    This thesis provides novel methodology for statistical analysis of paired high-dimensional genomic data, with the aimto identify gene interactions specific to each group of samples as well as the gene connections that change between the two classes of observations. An example of such groups can be patients under two medical conditions, in which the estimation of gene interaction networks is relevant to biologists as part of discerning gene regulatory mechanisms that control a disease process like, for instance, cancer. We construct these interaction networks fromdata by considering the non-zero structure of correlationmatrices, which measure linear dependence between random variables, and their inversematrices, which are commonly known as precision matrices and determine linear conditional dependence instead. In this regard, we study three statistical problems related to the testing, single estimation and joint estimation of (conditional) dependence structures. Firstly, we develop hypothesis testingmethods to assess the equality of two correlation matrices, and also two correlation sub-matrices, corresponding to two classes of samples, and hence the equality of the underlying gene interaction networks. We consider statistics based on the average of squares, maximum and sum of exceedances of sample correlations, which are suitable for both independent and paired observations. We derive the limiting distributions for the test statistics where possible and, for practical needs, we present a permuted samples based approach to find their corresponding non-parametric distributions. Cases where such hypothesis testing presents enough evidence against the null hypothesis of equality of two correlation matrices give rise to the problem of estimating two correlation (or precision) matrices. However, before that we address the statistical problem of estimating conditional dependence between random variables in a single class of samples when data are high-dimensional, which is the second topic of the thesis. We study the graphical lasso method which employs an L1 penalized likelihood expression to estimate the precision matrix and its underlying non-zero graph structure. The lasso penalization termis given by the L1 normof the precisionmatrix elements scaled by a regularization parameter, which determines the trade-off between sparsity of the graph and fit to the data, and its selection is our main focus of investigation. We propose several procedures to select the regularization parameter in the graphical lasso optimization problem that rely on network characteristics such as clustering or connectivity of the graph. Thirdly, we address the more general problem of estimating two precision matrices that are expected to be similar, when datasets are dependent, focusing on the particular case of paired observations. We propose a new method to estimate these precision matrices simultaneously, a weighted fused graphical lasso estimator. The analogous joint estimation method concerning two regression coefficient matrices, which we call weighted fused regression lasso, is also developed in this thesis under the same paired and high-dimensional setting. The two joint estimators maximize penalized marginal log likelihood functions, which encourage both sparsity and similarity in the estimated matrices, and that are solved using an alternating direction method of multipliers (ADMM) algorithm. Sparsity and similarity of thematrices are determined by two tuning parameters and we propose to choose them by controlling the corresponding average error rates related to the expected number of false positive edges in the estimated conditional dependence networks. These testing and estimation methods are implemented within the R package ldstatsHD, and are applied to a comprehensive range of simulated data sets as well as to high-dimensional real case studies of genomic data. We employ testing approaches with the purpose of discovering pathway lists of genes that present significantly different correlation matrices on healthy and unhealthy (e.g., tumor) samples. Besides, we use hypothesis testing problems on correlation sub-matrices to reduce the number of genes for estimation. The proposed joint estimation methods are then considered to find gene interactions that are common between medical conditions as well as interactions that vary in the presence of unhealthy tissues

    Selected topics in robotics for space exploration

    Get PDF
    Papers and abstracts included represent both formal presentations and experimental demonstrations at the Workshop on Selected Topics in Robotics for Space Exploration which took place at NASA Langley Research Center, 17-18 March 1993. The workshop was cosponsored by the Guidance, Navigation, and Control Technical Committee of the NASA Langley Research Center and the Center for Intelligent Robotic Systems for Space Exploration (CIRSSE) at RPI, Troy, NY. Participation was from industry, government, and other universities with close ties to either Langley Research Center or to CIRSSE. The presentations were very broad in scope with attention given to space assembly, space exploration, flexible structure control, and telerobotics
    corecore