123 research outputs found
Smoothing ADMM for Sparse-Penalized Quantile Regression with Non-Convex Penalties
This paper investigates quantile regression in the presence of non-convex and
non-smooth sparse penalties, such as the minimax concave penalty (MCP) and
smoothly clipped absolute deviation (SCAD). The non-smooth and non-convex
nature of these problems often leads to convergence difficulties for many
algorithms. While iterative techniques like coordinate descent and local linear
approximation can facilitate convergence, the process is often slow. This
sluggish pace is primarily due to the need to run these approximation
techniques until full convergence at each step, a requirement we term as a
\emph{secondary convergence iteration}. To accelerate the convergence speed, we
employ the alternating direction method of multipliers (ADMM) and introduce a
novel single-loop smoothing ADMM algorithm with an increasing penalty
parameter, named SIAD, specifically tailored for sparse-penalized quantile
regression. We first delve into the convergence properties of the proposed SIAD
algorithm and establish the necessary conditions for convergence.
Theoretically, we confirm a convergence rate of
for the sub-gradient bound of augmented Lagrangian. Subsequently, we provide
numerical results to showcase the effectiveness of the SIAD algorithm. Our
findings highlight that the SIAD method outperforms existing approaches,
providing a faster and more stable solution for sparse-penalized quantile
regression
Bayesian Model Selection in Complex Linear Systems, as Illustrated in Genetic Association Studies
Motivated by examples from genetic association studies, this paper considers
the model selection problem in a general complex linear model system and in a
Bayesian framework. We discuss formulating model selection problems and
incorporating context-dependent {\it a priori} information through different
levels of prior specifications. We also derive analytic Bayes factors and their
approximations to facilitate model selection and discuss their theoretical and
computational properties. We demonstrate our Bayesian approach based on an
implemented Markov Chain Monte Carlo (MCMC) algorithm in simulations and a real
data application of mapping tissue-specific eQTLs. Our novel results on Bayes
factors provide a general framework to perform efficient model comparisons in
complex linear model systems
Efficient inference for genetic association studies with multiple outcomes
Combined inference for heterogeneous high-dimensional data is critical in
modern biology, where clinical and various kinds of molecular data may be
available from a single study. Classical genetic association studies regress a
single clinical outcome on many genetic variants one by one, but there is an
increasing demand for joint analysis of many molecular outcomes and genetic
variants in order to unravel functional interactions. Unfortunately, most
existing approaches to joint modelling are either too simplistic to be powerful
or are impracticable for computational reasons. Inspired by Richardson et al.
(2010, Bayesian Statistics 9), we consider a sparse multivariate regression
model that allows simultaneous selection of predictors and associated
responses. As Markov chain Monte Carlo (MCMC) inference on such models can be
prohibitively slow when the number of genetic variants exceeds a few thousand,
we propose a variational inference approach which produces posterior
information very close to that of MCMC inference, at a much reduced
computational cost. Extensive numerical experiments show that our approach
outperforms popular variable selection methods and tailored Bayesian
procedures, dealing within hours with problems involving hundreds of thousands
of genetic variants and tens to hundreds of clinical or molecular outcomes
Scalable Feature Selection Applications for Genome-Wide Association Studies of Complex Diseases
Personalized medicine will revolutionize our capabilities to combat disease. Working toward this goal, a fundamental task is the deciphering of geneticvariants that are predictive of complex diseases. Modern studies, in the formof genome-wide association studies (GWAS) have afforded researchers with the opportunity to reveal new genotype-phenotype relationships through the extensive scanning of genetic variants. These studies typically contain over half a million genetic features for thousands of individuals. Examining this with methods other than univariate statistics is a challenging task requiring advanced algorithms that are scalable to the genome-wide level. In the future, next-generation sequencing studies (NGS) will contain an even larger number of common and rare variants.
Machine learning-based feature selection algorithms have been shown to have the ability to effectively create predictive models for various genotype-phenotype relationships. This work explores the problem of selecting genetic variant subsets that are the most predictive of complex disease phenotypes through various feature selection methodologies, including filter, wrapper and embedded algorithms. The examined machine learning algorithms were demonstrated to not only be effective at predicting the disease phenotypes, but also doing so efficiently through the use of computational shortcuts. While much of the work was able to be run on high-end desktops, some work was further extended so that it could be implemented on parallel computers helping to assure that they will also scale to the NGS data sets.
Further, these studies analyzed the relationships between various feature selection methods and demonstrated the need for careful testing when selecting an algorithm. It was shown that there is no universally optimal algorithm for variant selection in GWAS, but rather methodologies need to be selected based on the desired outcome, such as the number of features to be included in the prediction model. It was also demonstrated that without proper model validation, for example using nested cross-validation, the models can result in overly-optimistic prediction accuracies and decreased generalization ability. It is through the implementation and application of machine learning methods that one can extract predictive genotype–phenotype relationships and biological insights from genetic data sets.Siirretty Doriast
Advanced Methods for Discovering Genetic Markers Associated with High Dimensional Imaging Data
Imaging genetic studies have been widely applied to discover genetic factors of inherited neuropsychiatric diseases. Despite the notable contribution of genome-wide association studies (GWAS) in neuroimaging research, it has always been difficult to efficiently perform association analysis on imaging phenotypes. There are several challenges arising from this topic, such as the large dimensionality of imaging data and genetic data, the potential spatial dependency of imaging phenotypes and the computational burden of the GWAS problem. All the aforementioned issues motivate us to investigate new statistical methods in neuroimaging genetic analysis. In the first project, we develop a hierarchical functional principal regression model (HFPRM) to simultaneously study diffusion tensor bundle statistics on multiple fiber tracts. Theoretically, the asymptotic distribution of the global test statistic on the common factors has been studied. Simulations are conducted to evaluate the finite sample performance of HFPRM. Finally, we apply our method to a GWAS of a neonate population to explore important genetic architecture in early human brain development. In the second project, we consider an association test between functional data acquired on a single curve and scalar variables in a varying coefficient model. We propose a functional projection regression model and an associated global test statistic to aggregate weak signals across the domain of functional data. Theoretically, we examine the asymptotic distribution of the global test statistic and provide a strategy to adaptively select the tuning parameter. Simulation experiments show that the proposed test outperforms existing state-of-the-art methods in functional statistical inference. We also apply the proposed method to a GWAS in the UK Biobank dataset. In the third project, we introduce an adaptive projection regression model (APRM) to perform statistical inference on high dimensional imaging responses in the presence of high correlations. Dimension reduction of the phenotypes is achieved through a linear projection regression model. We also implement an adaptive inference procedure to detect signals at multiple levels. Numerical simulations demonstrate that APRM outperforms many state-of-the-art methods in high dimensional inference. Finally, we apply APRM to a GWAS of volumetric data on 93 regions of interest in the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset.Doctor of Philosoph
- …