139 research outputs found
Sparsity Oriented Importance Learning for High-dimensional Linear Regression
With now well-recognized non-negligible model selection uncertainty, data
analysts should no longer be satisfied with the output of a single final model
from a model selection process, regardless of its sophistication. To improve
reliability and reproducibility in model choice, one constructive approach is
to make good use of a sound variable importance measure. Although interesting
importance measures are available and increasingly used in data analysis,
little theoretical justification has been done. In this paper, we propose a new
variable importance measure, sparsity oriented importance learning (SOIL), for
high-dimensional regression from a sparse linear modeling perspective by taking
into account the variable selection uncertainty via the use of a sensible model
weighting. The SOIL method is theoretically shown to have the
inclusion/exclusion property: When the model weights are properly around the
true model, the SOIL importance can well separate the variables in the true
model from the rest. In particular, even if the signal is weak, SOIL rarely
gives variables not in the true model significantly higher important values
than those in the true model. Extensive simulations in several illustrative
settings and real data examples with guided simulations show desirable
properties of the SOIL importance in contrast to other importance measures
Meta Clustering for Collaborative Learning
An emerging number of learning scenarios involve a set of learners/analysts
each equipped with a unique dataset and algorithm, who may collaborate with
each other to enhance their learning performance. From the perspective of a
particular learner, a careless collaboration with task-irrelevant other
learners is likely to incur modeling error. A crucial problem is to search for
the most appropriate collaborators so that their data and modeling resources
can be effectively leveraged. Motivated by this, we propose to study the
problem of `meta clustering', where the goal is to identify subsets of relevant
learners whose collaboration will improve the performance of each individual
learner. In particular, we study the scenario where each learner is performing
a supervised regression, and the meta clustering aims to categorize the
underlying supervised relations (between responses and predictors) instead of
the raw data. We propose a general method named as Select-Exchange-Cluster
(SEC) for performing such a clustering. Our method is computationally efficient
as it does not require each learner to exchange their raw data. We prove that
the SEC method can accurately cluster the learners into appropriate
collaboration sets according to their underlying regression functions.
Synthetic and real data examples show the desired performance and wide
applicability of SEC to a variety of learning tasks
High-dimensional Variable Screening via Conditional Martingale Difference Divergence
Variable screening has been a useful research area that deals with
ultrahigh-dimensional data. When there exist both marginally and jointly
dependent predictors to the response, existing methods such as conditional
screening or iterative screening often suffer from instability against the
selection of the conditional set or the computational burden, respectively. In
this article, we propose a new independence measure, named conditional
martingale difference divergence (CMDH), that can be treated as either a
conditional or a marginal independence measure. Under regularity conditions, we
show that the sure screening property of CMDH holds for both marginally and
jointly active variables. Based on this measure, we propose a kernel-based
model-free variable screening method, which is efficient, flexible, and stable
against high correlation among predictors and heterogeneity of the response. In
addition, we provide a data-driven method to select the conditional set. In
simulations and real data applications, we demonstrate the superior performance
of the proposed method
- …