39 research outputs found

    Pathwise coordinate optimization

    Full text link
    We consider ``one-at-a-time'' coordinate-wise descent algorithms for a class of convex optimization problems. An algorithm of this kind has been proposed for the L1L_1-penalized regression (lasso) in the literature, but it seems to have been largely ignored. Indeed, it seems that coordinate-wise algorithms are not often used in convex optimization. We show that this algorithm is very competitive with the well-known LARS (or homotopy) procedure in large lasso problems, and that it can be applied to related methods such as the garotte and elastic net. It turns out that coordinate-wise descent does not work in the ``fused lasso,'' however, so we derive a generalized algorithm that yields the solution in much less time that a standard convex optimizer. Finally, we generalize the procedure to the two-dimensional fused lasso, and demonstrate its performance on some image smoothing problems.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS131 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    A robust penalized method for the analysis of noisy DNA copy number data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Deletions and amplifications of the human genomic DNA copy number are the causes of numerous diseases, such as, various forms of cancer. Therefore, the detection of DNA copy number variations (CNV) is important in understanding the genetic basis of many diseases. Various techniques and platforms have been developed for genome-wide analysis of DNA copy number, such as, array-based comparative genomic hybridization (aCGH) and high-resolution mapping with high-density tiling oligonucleotide arrays. Since complicated biological and experimental processes are often associated with these platforms, data can be potentially contaminated by outliers.</p> <p>Results</p> <p>We propose a penalized LAD regression model with the adaptive fused lasso penalty for detecting CNV. This method contains robust properties and incorporates both the spatial dependence and sparsity of CNV into the analysis. Our simulation studies and real data analysis indicate that the proposed method can correctly detect the numbers and locations of the true breakpoints while appropriately controlling the false positives.</p> <p>Conclusions</p> <p>The proposed method has three advantages for detecting CNV change points: it contains robustness properties; incorporates both spatial dependence and sparsity; and estimates the true values at each marker accurately.</p

    Adaptive estimation with partially overlapping models

    Get PDF
    In many problems, one has several models of interest that capture key parameters describing the distribution of the data. Partially overlapping models are taken as models in which at least one covariate effect is common to the models. A priori knowledge of such structure enables efficient estimation of all model parameters. However, in practice, this structure may be unknown. We propose adaptive composite M-estimation (ACME) for partially overlapping models using a composite loss function, which is a linear combination of loss functions defining the individual models. Penalization is applied to pairwise differences of parameters across models, resulting in data driven identification of the overlap structure. Further penalization is imposed on the individual parameters, enabling sparse estimation in the regression setting. The recovery of the overlap structure enables more efficient parameter estimation. An oracle result is established. Simulation studies illustrate the advantages of ACME over existing methods that fit individual models separately or make strong a priori assumption about the overlap structure

    Pivotal estimation via square-root Lasso in nonparametric regression

    Get PDF
    We propose a self-tuning Lasso\sqrt{\mathrm {Lasso}} method that simultaneously resolves three important practical problems in high-dimensional regression analysis, namely it handles the unknown scale, heteroscedasticity and (drastic) non-Gaussianity of the noise. In addition, our analysis allows for badly behaved designs, for example, perfectly collinear regressors, and generates sharp bounds even in extreme cases, such as the infinite variance case and the noiseless case, in contrast to Lasso. We establish various nonasymptotic bounds for Lasso\sqrt{\mathrm {Lasso}} including prediction norm rate and sparsity. Our analysis is based on new impact factors that are tailored for bounding prediction norm. In order to cover heteroscedastic non-Gaussian noise, we rely on moderate deviation theory for self-normalized sums to achieve Gaussian-like results under weak conditions. Moreover, we derive bounds on the performance of ordinary least square (ols) applied to the model selected by Lasso\sqrt{\mathrm {Lasso}} accounting for possible misspecification of the selected model. Under mild conditions, the rate of convergence of ols post Lasso\sqrt{\mathrm {Lasso}} is as good as Lasso\sqrt{\mathrm {Lasso}}'s rate. As an application, we consider the use of Lasso\sqrt{\mathrm {Lasso}} and ols post Lasso\sqrt{\mathrm {Lasso}} as estimators of nuisance parameters in a generic semiparametric problem (nonlinear moment condition or ZZ-problem), resulting in a construction of n\sqrt{n}-consistent and asymptotically normal estimators of the main parameters.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1204 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Contributions to Penalized Estimation

    Get PDF
    Penalized estimation is a useful statistical technique to prevent overfitting problems. In penalized methods, the common objective function is in the form of a loss function for goodness of fit plus a penalty function for complexity control. In this dissertation, we develop several new penalization approaches for various statistical models. These methods aim for effective model selection and accurate parameter estimation. The first part introduces the notion of partially overlapping models across multiple regression models on the same dataset. Such underlying models have at least one overlapping structure sharing the same parameter value. To recover the sparse and overlapping structure, we develop adaptive composite M-estimation (ACME) by doubly penalizing a composite loss function, as a weighted linear combination of the loss functions. ACME automatically circumvents the model misspecification issues inherent in other composite-loss-based estimators. The second part proposes a new refit method and its applications in the regression setting through model combination: ensemble variable selection (EVS) and ensemble variable selection and estimation (EVE). The refit method estimates the regression parameters restricted to the selected covariates by a penalization method. EVS combines model selection decisions from multiple penalization methods and selects the optimal model via the refit and a model selection criterion. EVE considers a factorizable likelihood-based model whose full likelihood is the multiplication of likelihood factors. EVE is shown to have asymptotic efficiency and computational efficiency. The third part studies a sparse undirected Gaussian graphical model (GGM) to explain conditional dependence patterns among variables. The edge set consists of conditionally dependent variable pairs and corresponds to nonzero elements of the inverse covariance matrix under the Gaussian assumption. We propose a consistent validation method for edge selection (CoVES) in the penalization framework. CoVES selects candidate edge sets along the solution path and finds the optimal set via repeated subsampling. CoVES requires simple computation and delivers excellent performance in our numerical studies.Doctor of Philosoph

    Statistical Methods for Gene-Environment Interactions

    Get PDF
    Despite significant main effects of genetic and environmental risk factors have been found, the interactions between them can play critical roles and demonstrate important implications in medical genetics and epidemiology. Although many important gene-environment (G-E) interactions have been identified, the existing findings are still insufficient and there exists a strong need to develop statistical methods for analyzing G-E interactions. In this dissertation, we propose four statistical methodologies and computational algorithms for detecting G-E interactions and one application to imaging data. Extensive simulation studies are conducted in comparison with multiple advanced alternatives. In the analyses of The Cancer Genome Atlas datasets on multiple cancers, biologically meaningful findings are obtained. First, we develop two robust interaction analysis methods for prognostic outcomes. Compared to continuous and categorical outcomes, prognosis has been less investigated, with additional challenges brought by the unique characteristics of survival times. Most of the existing G-E interaction approaches for prognosis data share the limitation that they cannot accommodate long-tailed or contaminated outcomes. In the first method, we adopt the censored quantile regression and partial correlation for survival outcomes. Under a marginal modeling framework, this proposed approach is robust to long-tailed prognosis and is computationally straightforward to apply. Furthermore, outliers and contaminations among predictors are observed in real data. In the second method, we propose a joint model using the penalized trimmed regression that is robust to leverage points and vertical outliers. The proposed method respects the hierarchical structure of main effects and interactions and has an effective computational algorithm based on coordinate descent optimization and stability selection. Second, we propose a penalized approach to incorporate additional information for identifying important hierarchical interactions. Due to the high dimensionality and low signal levels, it is challenging to analyze interactions so that incorporating additional information is desired. We adopt the minimax concave penalty for regularized estimation and the Laplacian quadratic penalty for additional information. Under a unified formulation, multiple types of additional information and genetic measurements can be effectively utilized and improved identification accuracy can be achieved. Third, we develop a three-step procedure using multidimensional molecular data to identify G-E interactions. Recent studies have shown that collectively analyzing multiple types of molecular changes is not only biologically sensible but also leads to improved estimation and prediction. In this proposed method, we first estimate the relationship between gene expressions and their regulators by a multivariate penalized regression, and then identify regulatory modules via sparse biclustering. Next, we establish integrative covariates by principal components extracted from the identified regulatory modules. Last but not least, we construct a joint model for disease outcomes and employ Lasso-based penalization to select important main effects and hierarchical interactions. The proposed method expands the scope of interaction analysis to multidimensional molecular data. Last, we present an application using both marginal and joint models to analyze histopathological imaging-environment interactions. In cancer diagnosis, histopathological imaging has been routinely conducted and can be processed to generate high-dimensional features. To explore potential interactions, we conduct marginal and joint analyses, which have been extensively examined in the context of G-E interactions. This application extends the practical applicability of interaction analysis to imaging data and provides an alternative venue that combines histopathological imaging and environmental data in cancer modeling. Motivated by the important implications of G-E interactions and to overcome the limitations of the existing methods, the goal of this dissertation is to advance in methodological development for G-E interaction analysis and to provide practically useful tools for identifying important interactions. The proposed methods emerge from practical issues observed in real data and have solid statistical properties. With a balance between theory, computation, and data analysis, this dissertation provide four novel approaches for analyzing interactions to achieve more robust and accurate identification of biologically meaningful interactions

    Variable selection and predictive models in Big Data environments

    Get PDF
    Mención Internacional en el título de doctorIn recent years, the advances in data collection technologies have presented a difficult challenge by extracting increasingly complex and larger datasets. Traditionally, statistics methodologies treated with datasets where the number of variables did not exceed the number of observations, however, dealing with problems where the number of variables is larger than the number of observations has become more and more common, and can be seen in areas like economics, genetics, climate data, computer vision etc. This problem has required the development of new methodologies suitable for a high dimensional framework. Most of the statistical methodologies are limited to the study of averages. Least squares regression, principal component analysis, partial least squares... All these techniques provide mean based estimations, and are built around the key idea that the data is normally distributed. But this is an assumption that is usually unverified in real datasets, where skewness and outliers can easily be found. The estimation of other metrics like the quantiles can help providing a more complete image of the data distribution. This thesis is built around these two core ideas. The development of more robust, quantile based methodologies suitable for high dimensional problems. The thesis is structured as a compendium of articles, divided into four chapters where each chapter has independent content and structure but is nevertheless encompassed within the main objective of the thesis. First, Chapter 1 introduces basic concepts and results, assumed to be known or referenced in the rest of the thesis. A possible solution when dealing with high dimensional problems in the field of regression is the usage of variable selection techniques. In this regard, sparse group lasso (SGL) has proven to be a very effective alternative. However, the mathematical formulation of this estimator introduces some bias in the model, which means that it is possible that the variables selected by the model are not the truly significant ones. Chapter 2 studies the formulation of an adaptive sparse group lasso for quantile regression, a more flexible formulation that makes use of the adaptive idea, this is, the usage of adaptive weights in the penalization to help correcting the bias, improving this way variable selection and prediction accuracy. An alternative solution to the high dimensional problem is the usage of a dimension reduction technique like partial least squares. Partial least squares (PLS) is a methodology initially proposed in the field of chemometrics as an alternative to traditional least squares regression when the data is high dimensional or faces colinearity. It works by projecting the independent data matrix into a subspace of uncorrelated variables that maximize the covariance with the response matrix. However, being an iterative process based on least squares makes this methodology extremely sensitive to the presence of outliers or heteroscedasticity. Chapter 3 defines the fast partial quantile regression, a technique that performs a projection into a subspace where a quantile covariance metric is maximized, effectively extending partial least squares to the quantile regression framework. Another field where it is common to find high dimensional data is in functional data analysis, where the observations are functions measured along time, instead of scalars. A key technique in this field is functional principal component analysis (FPCA), a methodology that provides an orthogonal set of basis functions that best explains the variability in the data. However, FPCA fails capturing shifts in the scale of the data affecting the quantiles. Chapter 4 introduces the functional quantile factor model. A methodology that extends the concept of FPCA to quantile regression, obtaining a model that can explain the quantiles of the data conditional on a set of common functions. In Chapter 5, asgl, a Python package that solves penalized least squares and quantile regression models in low and high dimensional is introduced frameworks is introduced, filling a gap in the currently available implementations of these models. Finally, Chapter 6 presents the final conclusions of this thesis, including possible lines of research and future work.En los últimos años, los avances en las tecnologías de recopilación de datos han planteado un difícil reto al extraer conjuntos de datos cada vez más complejos y de mayor tamaño. Tradicionalmente, las metodologías estadísticas trataban con conjuntos de datos en los que el número de variables no superaba el número de observaciones, sin embargo, enfrentarse a problemas en los que el número de variables es mayor que el número de observaciones se ha convertido en algo cada vez más común, y puede verse en áreas como la economía, la genética, los datos relacionados con el clima, la visión por ordenador, etc. Este problema ha exigido el desarrollo de nuevas metodologías adecuadas para un marco de alta dimensión. La mayoría de las metodologías estadísticas se limitan al estudio de la media. Regresión por mínimos cuadrados, análisis de componentes principales, mínimos cuadrados parciales... Todas estas técnicas proporcionan estimaciones basadas en la media, y están construidas en torno a la idea clave de que los datos se distribuyen normalmente. Pero esta es una suposición que no suele verificarse en los conjuntos de datos reales, en los que es fácil encontrar asimetrías y valores atípicos. La estimación de otras métricas como los cuantiles puede ayudar a proporcionar una imagen más completa de la distribución de los datos. Esta tesis se basa en estas dos ideas fundamentales. El desarrollo de metodologías más robustas, basadas en cuantiles, adecuadas para problemas de alta dimensión. La tesis está estructurada como un compendio de artículos, divididos en cuatro capítulos en los que cada uno de ellos tiene un contenido y una estructura independientes pero que, sin embargo, se engloban dentro del objetivo principal de la tesis. En primer lugar, el Capítulo 1 introduce conceptos y resultados básicos, que se suponen conocidos o a los que se hace referencia en el resto de la tesis. Una posible solución cuando se trata con problemas de alta dimensión en el campo de la regresión es el uso de técnicas de selección de variables. En este sentido, el sparse group lasso (SGL) ha demostrado ser una alternativa muy eficaz. Sin embargo, la formulación matemática de este estimador introduce cierto sesgo en el modelo, lo que significa que es posible que las variables seleccionadas por el modelo no sean las verdaderamente significativas. El Capítulo 2 estudia la formulación de un adaptive sparse group lasso para la regresión cuantílica, una formulación más flexible que hace uso de la idea adaptive, es decir, el uso de pesos adaptativos en la penalización para ayudar a corregir el sesgo, mejorando así la selección de variables y la precisión de las predicciones. Una solución alternativa al problema de la alta dimensionalidad es el uso de una técnica de reducción de dimensión como los mínimos cuadrados parciales. Los mínimos cuadrados parciales (PLS por sus siglas en inglés) es una metodología definida inicialmente en el campo de la quimiometría como una alternativa a la regresión tradicional por mínimos cuadrados cuando los datos son de alta dimensión o tienen problemas de colinearidad. Funciona proyectando la matriz de datos independiente en un subespacio de variables no correlacionadas que maximiza la covarianza con la matriz de respuesta. Sin embargo, al ser un proceso iterativo basado en mínimos cuadrados, esta metodología es extremadamente sensible a la presencia de valores atípicos o heteroscedasticidad. El Capítulo 3 define el fast partial quantile regression, una técnica que realiza una proyección en un subespacio en el que se maximiza una métrica de covarianza cuantílica, extendiendo de forma efectiva los mínimos cuadrados parciales al marco de la regresión cuantílica. Otro campo en el que es habitual encontrar datos de alta dimensión es el del análisis de datos funcionales, en el que las observaciones son funciones medidas a lo largo del tiempo, en lugar de escalares. Una técnica clave en este campo es el análisis de componentes principales funcionales (FPCA por sus siglas en inglés), una metodología que proporciona una base ortogonal de funciones que explica la mayor cantidad posible de variabilidad en los datos. Sin embargo, el FPCA no capta los cambios de escala de los datos que afectan a los cuantiles. El Capítulo 4 presenta el functional quantile factor model. Una metodología que extiende el concepto de FPCA a la regresión cuantílica, obteniendo un modelo que puede explicar los cuantiles de los datos condicionados a un conjunto de funciones comunes. En el capítulo 5 asgl, un paquete para Python que resuelve modelos de mínimos cuadrados y regresión cuantílica penalizados en entornos de baja y alta dimensión es presentado, llenando un vacío en las implementaciones actualmente disponibles de estos modelos. Por último, el Capítulo 6 presenta las conclusiones finales de esta tesis, incluyendo posibles líneas de investigación y trabajo futuro.I want to acknowledge the financial support received by research grants and projects PIPF UC3M, ECO2015-66593-P (Ministerio de Economía y Competitividad, Spain) and PID2020-113961GB-I00 (Agencia Estatal de Investigación, Spain).Programa de Doctorado en Ingeniería Matemática por la Universidad Carlos III de MadridPresidenta: María Luz Durban Reguera.- Secretaria: María Ángeles Gil Álvarez.- Vocal: Ying We
    corecore