32 research outputs found
A Selective Review of Group Selection in High-Dimensional Models
Grouping structures arise naturally in many statistical modeling problems.
Several methods have been proposed for variable selection that respect grouping
structure in variables. Examples include the group LASSO and several concave
group selection methods. In this article, we give a selective review of group
selection concerning methodological developments, theoretical properties and
computational algorithms. We pay particular attention to group selection
methods involving concave penalties. We address both group selection and
bi-level selection methods. We describe several applications of these methods
in nonparametric additive models, semiparametric regression, seemingly
unrelated regressions, genomic data analysis and genome wide association
studies. We also highlight some issues that require further study.Comment: Published in at http://dx.doi.org/10.1214/12-STS392 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Model-Free Variable Screening, Sparse Regression Analysis and Other Applications with Optimal Transformations
Variable screening and variable selection methods play important roles in modeling high dimensional data. Variable screening is the process of filtering out irrelevant variables, with the aim to reduce the dimensionality from ultrahigh to high while retaining all important variables. Variable selection is the process of selecting a subset of relevant variables for use in model construction. The main theme of this thesis is to develop variable screening and variable selection methods for high dimensional data analysis. In particular, we will present two relevant methods for variable screening and selection under a unified framework based on optimal transformations.
In the first part of the thesis, we develop a maximum correlation-based sure independence screening (MC-SIS) procedure to screen features in an ultrahigh-dimensional setting. We show that MC-SIS possesses the sure screen property without imposing model or distributional assumptions on the response and predictor variables. MC-SIS is a model-free method in contrast with some other existing model-based sure independence screening methods in the literature. In the second part of the thesis, we develop a novel method called SParse Optimal Transformations (SPOT) to simultaneously select important variables and explore relationships between the response and predictor variables in high dimensional nonparametric regression analysis. Not only are the optimal transformations identified by SPOT interpretable, they can also be used for response prediction. We further show that SPOT achieves consistency in both variable selection and parameter estimation.
Besides variable screening and selection, we also consider other applications with optimal transformations. In the third part of the thesis, we propose several dependence measures, for both univariate and multivariate random variables, based on maximum correlation and B-spline approximation. B-spline based Maximum Correlation (BMC) and Trace BMC (T-BMC) are introduced to measure dependence between two univariate random variables. As extensions to BMC and T-BMC, Multivariate BMC (MBMC) and Trace Multivariate BMC (T-MBMC) are proposed to measure dependence between multivariate random variables. We give convergence rates for both BMC and T-BMC.
Numerical simulations and real data applications are used to demonstrate the performances of proposed methods. The results show that the proposed methods outperform other existing ones and can serve as effective tools in practice
Recent advances on the reduction and analysis of big and high-dimensional data
In an era with remarkable advancements in computer engineering, computational algorithms, and mathematical modeling, data scientists are inevitably faced with the challenge of working with big and high-dimensional data. For many problems, data reduction is a necessary first step; such reduction allows for storage and portability of big data, and enables the computation of expensive downstream quantities. The next step then involves the analysis of big data -- the use of such data for modeling, inference, and prediction. This thesis presents new methods for big data reduction and analysis, with a focus on solving real-world problems in statistics, machine learning and engineering.Ph.D
Compact Formulations for Sparse Reconstruction in Fully and Partly Calibrated Sensor Arrays
Sensor array processing is a classical field of signal processing which offers various applications in practice, such as direction of arrival estimation or signal reconstruction, as well as a rich theory, including numerous estimation methods and statistical bounds on the achievable estimation performance. A comparably new field in signal processing is given by sparse signal reconstruction (SSR), which has attracted remarkable interest in the research community during the last years and similarly offers plentiful fields of application. This thesis considers the application of SSR in fully calibrated sensor arrays as well as in partly calibrated sensor arrays. The main contributions are a novel SSR method for application in partly calibrated arrays as well as compact formulations for the SSR problem, where special emphasis is given on exploiting specific structure in the signals as well as in the array topologies
Sparse Model Building From Genome-Wide Variation With Graphical Models
High throughput sequencing and expression characterization have lead to an explosion of phenotypic and genotypic molecular data underlying both experimental studies and outbred populations. We develop a novel class of algorithms to reconstruct sparse models among these molecular phenotypes (e.g. expression products) and genotypes (e.g. single nucleotide polymorphisms), via both a Bayesian hierarchical model, when the sample size is much smaller than the model dimension (i.e. p n) and the well characterized adaptive lasso algo- rithm. Specifically, we propose novel approaches to the problems of increasing power to detect additional loci in genome-wide association studies using our variational algorithm, efficiently learning directed cyclic graphs from expression and genotype data using the adaptive lasso, and constructing genomewide undirected graphs among genotype, expression and downstream phenotype data using an extension of the variational feature selection algorithm. The Bayesian hierarchical model is derived for a parametric multiple regression model with a mixture prior of a point mass and normal distribution for each regression coefficient, and appropriate priors for the set of hyperparameters. When combined with a probabilistic consistency bound on the model dimension, this approach leads to very sparse solutions without the need for cross validation. We use a variational Bayes approximate inference approach in our algorithm, where we impose a complete factorization across all parameters for the approximate posterior distribution, and then minimize the KullbackLeibler divergence between the approximate and true posterior distributions. Since the prior distribution is non-convex, we restart the algorithm many times to find multiple posterior modes, and combine information across all discovered modes in an approximate Bayesian model averaging framework, to reduce the variance of the posterior probability estimates. We perform analysis of three major publicly available data-sets: the HapMap 2 genotype and expression data collected on immortalized lymphoblastoid cell lines, the genome-wide gene expression and genetic marker data collected for a yeast intercross, and genomewide gene expression, genetic marker, and downstream phenotypes related to weight in a mouse F2 intercross. Based on both simulations and data analysis we show that our algorithms can outperform other state of the art model selection procedures when including thousands to hundreds of thousands of genotypes and expression traits, in terms of aggressively controlling false discovery rate, and generating rich simultaneous statistical models
Statistical methods for the testing and estimation of linear dependence structures on paired high-dimensional data: application to genomic data
This thesis provides novel methodology for statistical analysis of paired high-dimensional genomic
data, with the aimto identify gene interactions specific to each group of samples as well as the gene
connections that change between the two classes of observations. An example of such groups can
be patients under two medical conditions, in which the estimation of gene interaction networks is
relevant to biologists as part of discerning gene regulatory mechanisms that control a disease process
like, for instance, cancer. We construct these interaction networks fromdata by considering the non-zero
structure of correlationmatrices, which measure linear dependence between random variables,
and their inversematrices, which are commonly known as precision matrices and determine linear
conditional dependence instead. In this regard, we study three statistical problems related to the
testing, single estimation and joint estimation of (conditional) dependence structures.
Firstly, we develop hypothesis testingmethods to assess the equality of two correlation matrices,
and also two correlation sub-matrices, corresponding to two classes of samples, and hence the equality
of the underlying gene interaction networks. We consider statistics based on the average of squares,
maximum and sum of exceedances of sample correlations, which are suitable for both independent
and paired observations. We derive the limiting distributions for the test statistics where possible
and, for practical needs, we present a permuted samples based approach to find their corresponding
non-parametric distributions.
Cases where such hypothesis testing presents enough evidence against the null hypothesis of
equality of two correlation matrices give rise to the problem of estimating two correlation (or precision)
matrices. However, before that we address the statistical problem of estimating conditional
dependence between random variables in a single class of samples when data are high-dimensional,
which is the second topic of the thesis. We study the graphical lasso method which employs an L1
penalized likelihood expression to estimate the precision matrix and its underlying non-zero graph
structure. The lasso penalization termis given by the L1 normof the precisionmatrix elements scaled
by a regularization parameter, which determines the trade-off between sparsity of the graph and fit
to the data, and its selection is our main focus of investigation. We propose several procedures to
select the regularization parameter in the graphical lasso optimization problem that rely on network
characteristics such as clustering or connectivity of the graph.
Thirdly, we address the more general problem of estimating two precision matrices that are
expected to be similar, when datasets are dependent, focusing on the particular case of paired
observations. We propose a new method to estimate these precision matrices simultaneously, a
weighted fused graphical lasso estimator. The analogous joint estimation method concerning two
regression coefficient matrices, which we call weighted fused regression lasso, is also developed in
this thesis under the same paired and high-dimensional setting. The two joint estimators maximize
penalized marginal log likelihood functions, which encourage both sparsity and similarity in the
estimated matrices, and that are solved using an alternating direction method of multipliers (ADMM)
algorithm. Sparsity and similarity of thematrices are determined by two tuning parameters and we
propose to choose them by controlling the corresponding average error rates related to the expected
number of false positive edges in the estimated conditional dependence networks.
These testing and estimation methods are implemented within the R package ldstatsHD, and are
applied to a comprehensive range of simulated data sets as well as to high-dimensional real case
studies of genomic data. We employ testing approaches with the purpose of discovering pathway
lists of genes that present significantly different correlation matrices on healthy and unhealthy (e.g.,
tumor) samples. Besides, we use hypothesis testing problems on correlation sub-matrices to reduce
the number of genes for estimation. The proposed joint estimation methods are then considered to
find gene interactions that are common between medical conditions as well as interactions that vary
in the presence of unhealthy tissues
Selected topics in robotics for space exploration
Papers and abstracts included represent both formal presentations and experimental demonstrations at the Workshop on Selected Topics in Robotics for Space Exploration which took place at NASA Langley Research Center, 17-18 March 1993. The workshop was cosponsored by the Guidance, Navigation, and Control Technical Committee of the NASA Langley Research Center and the Center for Intelligent Robotic Systems for Space Exploration (CIRSSE) at RPI, Troy, NY. Participation was from industry, government, and other universities with close ties to either Langley Research Center or to CIRSSE. The presentations were very broad in scope with attention given to space assembly, space exploration, flexible structure control, and telerobotics