25,199 research outputs found
Optimality of Graphlet Screening in High Dimensional Variable Selection
Consider a linear regression model where the design matrix X has n rows and p
columns. We assume (a) p is much large than n, (b) the coefficient vector beta
is sparse in the sense that only a small fraction of its coordinates is
nonzero, and (c) the Gram matrix G = X'X is sparse in the sense that each row
has relatively few large coordinates (diagonals of G are normalized to 1).
The sparsity in G naturally induces the sparsity of the so-called graph of
strong dependence (GOSD). We find an interesting interplay between the signal
sparsity and the graph sparsity, which ensures that in a broad context, the set
of true signals decompose into many different small-size components of GOSD,
where different components are disconnected.
We propose Graphlet Screening (GS) as a new approach to variable selection,
which is a two-stage Screen and Clean method. The key methodological innovation
of GS is to use GOSD to guide both the screening and cleaning. Compared to
m-variate brute-forth screening that has a computational cost of p^m, the GS
only has a computational cost of p (up to some multi-log(p) factors) in
screening.
We measure the performance of any variable selection procedure by the minimax
Hamming distance. We show that in a very broad class of situations, GS achieves
the optimal rate of convergence in terms of the Hamming distance. Somewhat
surprisingly, the well-known procedures subset selection and the lasso are rate
non-optimal, even in very simple settings and even when their tuning parameters
are ideally set
Easing Embedding Learning by Comprehensive Transcription of Heterogeneous Information Networks
Heterogeneous information networks (HINs) are ubiquitous in real-world
applications. In the meantime, network embedding has emerged as a convenient
tool to mine and learn from networked data. As a result, it is of interest to
develop HIN embedding methods. However, the heterogeneity in HINs introduces
not only rich information but also potentially incompatible semantics, which
poses special challenges to embedding learning in HINs. With the intention to
preserve the rich yet potentially incompatible information in HIN embedding, we
propose to study the problem of comprehensive transcription of heterogeneous
information networks. The comprehensive transcription of HINs also provides an
easy-to-use approach to unleash the power of HINs, since it requires no
additional supervision, expertise, or feature engineering. To cope with the
challenges in the comprehensive transcription of HINs, we propose the HEER
algorithm, which embeds HINs via edge representations that are further coupled
with properly-learned heterogeneous metrics. To corroborate the efficacy of
HEER, we conducted experiments on two large-scale real-words datasets with an
edge reconstruction task and multiple case studies. Experiment results
demonstrate the effectiveness of the proposed HEER model and the utility of
edge representations and heterogeneous metrics. The code and data are available
at https://github.com/GentleZhu/HEER.Comment: 10 pages. In Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, London, United Kingdom,
ACM, 201
- …