13 research outputs found

    Some new approaches in high-dimensional variable selection and regression

    No full text
    Variable selection and estimation for high-dimensional data have become a topic of foremost importance in modern Statistics. This is largely driven by the need to analyze massive data sets due to recent technological advancements (Fan and Li, 2006; Yu, 2007). For instance, bioengineering innovations have presented new statistical challenges by introducing functional MRI and gene microarray data. In many of these applications, we wish to achieve better prediction accuracy and allow easier interpretability by reducing the number of variables to obtain a parsimonious or sparse model. In this thesis, we study and propose several new methodologies for regression and variable selection under high-dimensionality. In the first part of the thesis, we propose the weighted fusion, a new penalized regression and variable selection method for data with correlated variables. The weighted fusion can potentially incorporate information redundancy among correlated variables for estimation and variable selection. Weighted fusion is also useful when the number of predictors p is larger than the number of observations n. It allows the selection of more than n variables in a motivated way. Real data and simulation examples show that weighted fusion can improve variable selection and prediction accuracy. In the second part of the thesis, we propose the covariance-thresholded lasso. Covariance-thresholded lasso presents as an important marriage of covariance-regularization and penalized regression methods to allow better variable selection and prediction accuracy for high-dimensional data. Covariance-thresholded lasso improves upon excessive variability and rank deficiency of the sample covariance matrix of the lasso by utilizing covariance sparsity. In high-dimensions, where many predictors are independent or weakly correlated, covariance sparsity is a natural assumption. Real-data and simulation examples indicate that our method can be very useful in improving performances. In the third part of the thesis, we propose the ridge-lasso hybrid estimator (ridle), a new penalized regression method that simultaneously estimates coefficients of mandatory predictors while allowing selection for others. The ridle is useful when some predictors are known to be significant due to prior knowledge or must be kept for additional analysis. Further, we propose the adaptive ridle, for use when good initial estimates are available. Through theoretical studies, we show that the ridle and adaptive ridle can improve variable selection for regression with mandatory variables

    Rare Variants Association Analysis in Large-Scale Sequencing Studies at the Single Locus Level

    No full text
    Genetic association analyses of rare variants in next-generation sequencing (NGS) studies are fundamentally challenging due to the presence of a very large number of candidate variants at extremely low minor allele frequencies. Recent developments often focus on pooling multiple variants to provide association analysis at the gene instead of the locus level. Nonetheless, pinpointing individual variants is a critical goal for genomic researches as such information can facilitate the precise delineation of molecular mechanisms and functions of genetic factors on diseases. Due to the extreme rarity of mutations and high-dimensionality, significances of causal variants cannot easily stand out from those of noncausal ones. Consequently, standard false-positive control procedures, such as the Bonferroni and false discovery rate (FDR), are often impractical to apply, as a majority of the causal variants can only be identified along with a few but unknown number of noncausal variants. To provide informative analysis of individual variants in large-scale sequencing studies, we propose the Adaptive False-Negative Control (AFNC) procedure that can include a large proportion of causal variants with high confidence by introducing a novel statistical inquiry to determine those variants that can be confidently dispatched as noncausal. The AFNC provides a general framework that can accommodate for a variety of models and significance tests. The procedure is computationally efficient and can adapt to the underlying proportion of causal variants and quality of significance rankings. Extensive simulation studies across a plethora of scenarios demonstrate that the AFNC is advantageous for identifying individual rare variants, whereas the Bonferroni and FDR are exceedingly over-conservative for rare variants association studies. In the analyses of the CoLaus dataset, AFNC has identified individual variants most responsible for gene-level significances. Moreover, single-variant results using the AFNC have been successfully applied to infer related genes with annotation information.National Institutes of Health [P01 CA142538]Open Access JournalThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]

    Illustrations of regions of statistical inference for GWA and NGS studies.

    No full text
    <p>The Signals (“S”), Indistinguishable (“I”), and Noise (“N”) regions are shown. False-positive control allows the selection of variants in the Signals region, whereas false-negative control selects from both the Signals and Indistinguishable regions. In NGS studies with rare variants, the Signals region often degenerates due to extremely low MAF and high dimensionality.</p

    Annotation of AFNC-selected non-synonymous and splice-site variants in the analysis of CoLaus data.

    No full text
    <p>Annotation of AFNC-selected non-synonymous and splice-site variants in the analysis of CoLaus data.</p

    Annotation of AFNC-selected variants of candidate genes in the analysis of CoLaus data.

    No full text
    <p>Annotation of AFNC-selected variants of candidate genes in the analysis of CoLaus data.</p

    Comparisons across varying effect sizes and numbers of variants at <i>s</i> = 50.

    No full text
    <p>Performance of AFNC, FDR, and Bonferroni is evaluated in terms of sensitivity, specificity, and <i>g</i>-measure. Results are shown for <i>s</i> = 50 number of causal variants when <i>C</i> ≠ 0 and <i>n</i> = 2000 number of samples.</p

    Number of variants selected in the analysis of CoLaus data at different control levels.

    No full text
    <p>Number of variants selected in the analysis of CoLaus data at different control levels.</p

    Comparisons across varying sample sizes and numbers of causal variants at <i>C</i> = 0.5.

    No full text
    <p>Performance of AFNC, FDR, and Bonferroni is evaluated in terms of sensitivity, specificity, and <i>g</i>-measure. Results are shown for the effect-size multiplier <i>C</i> = 0.5 and <i>d</i> = 100,000 number of variants.</p

    Classifications of variants under multiple testing control.

    No full text
    <p>Classifications of variants under multiple testing control.</p
    corecore