1,069 research outputs found
Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping
We consider the problem of estimating a sparse multi-response regression
function, with an application to expression quantitative trait locus (eQTL)
mapping, where the goal is to discover genetic variations that influence
gene-expression levels. In particular, we investigate a shrinkage technique
capable of capturing a given hierarchical structure over the responses, such as
a hierarchical clustering tree with leaf nodes for responses and internal nodes
for clusters of related responses at multiple granularity, and we seek to
leverage this structure to recover covariates relevant to each
hierarchically-defined cluster of responses. We propose a tree-guided group
lasso, or tree lasso, for estimating such structured sparsity under
multi-response regression by employing a novel penalty function constructed
from the tree. We describe a systematic weighting scheme for the overlapping
groups in the tree-penalty such that each regression coefficient is penalized
in a balanced manner despite the inhomogeneous multiplicity of group
memberships of the regression coefficients due to overlaps among groups. For
efficient optimization, we employ a smoothing proximal gradient method that was
originally developed for a general class of structured-sparsity-inducing
penalties. Using simulated and yeast data sets, we demonstrate that our method
shows a superior performance in terms of both prediction errors and recovery of
true sparsity patterns, compared to other methods for learning a
multivariate-response regression.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS549 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Ultra-high Dimensional Multiple Output Learning With Simultaneous Orthogonal Matching Pursuit: A Sure Screening Approach
We propose a novel application of the Simultaneous Orthogonal Matching
Pursuit (S-OMP) procedure for sparsistant variable selection in ultra-high
dimensional multi-task regression problems. Screening of variables, as
introduced in \cite{fan08sis}, is an efficient and highly scalable way to
remove many irrelevant variables from the set of all variables, while retaining
all the relevant variables. S-OMP can be applied to problems with hundreds of
thousands of variables and once the number of variables is reduced to a
manageable size, a more computationally demanding procedure can be used to
identify the relevant variables for each of the regression outputs. To our
knowledge, this is the first attempt to utilize relatedness of multiple outputs
to perform fast screening of relevant variables. As our main theoretical
contribution, we prove that, asymptotically, S-OMP is guaranteed to reduce an
ultra-high number of variables to below the sample size without losing true
relevant variables. We also provide formal evidence that a modified Bayesian
information criterion (BIC) can be used to efficiently determine the number of
iterations in S-OMP. We further provide empirical evidence on the benefit of
variable selection using multiple regression outputs jointly, as opposed to
performing variable selection for each output separately. The finite sample
performance of S-OMP is demonstrated on extensive simulation studies, and on a
genetic association mapping problem. Adaptive Lasso; Greedy forward
regression; Orthogonal matching pursuit; Multi-output regression; Multi-task
learning; Simultaneous orthogonal matching pursuit; Sure screening; Variable
selectio
Integrating Document Clustering and Topic Modeling
Document clustering and topic modeling are two closely related tasks which
can mutually benefit each other. Topic modeling can project documents into a
topic space which facilitates effective document clustering. Cluster labels
discovered by document clustering can be incorporated into topic models to
extract local topics specific to each cluster and global topics shared by all
clusters. In this paper, we propose a multi-grain clustering topic model
(MGCTM) which integrates document clustering and topic modeling into a unified
framework and jointly performs the two tasks to achieve the overall best
performance. Our model tightly couples two components: a mixture component used
for discovering latent groups in document collection and a topic model
component used for mining multi-grain topics including local topics specific to
each cluster and global topics shared across clusters.We employ variational
inference to approximate the posterior of hidden variables and learn model
parameters. Experiments on two datasets demonstrate the effectiveness of our
model.Comment: Appears in Proceedings of the Twenty-Ninth Conference on Uncertainty
in Artificial Intelligence (UAI2013
- …