Search CORE

80 research outputs found

Recommended from our members

Theoretical analysis for convex and non-convex clustering algorithms

Author: Yan Bowei
Publication venue
Publication date: 04/09/2018
Field of study

Clustering is one of the most important unsupervised learning problem in the machine learning and statistics community. Given a set of observations, the goal is to find the latent cluster assignment of the data points. The observations can be either some covariates corresponding to each data point, or the relational networks representing the affinity between pair of nodes. We study the problem of community detection in stochastic block models and clustering mixture models. The two kinds of problems bear a lot of resemblance, and similar techniques can be applied to solve them. It is common practice to assume some underlying model for the data generating process in order to analyze it properly. With some pre-defined partitions of all data points, generative models can be defined to represent those two types of data observations. For the covariates, the mixture model is one of the most flexible and widely-used models, where each cluster i comes from some distribution D [subscript i], and the entire distribution is a convex sum over all distributions [mathematical equation]. We assume that the data is Gaussian or sub-gaussian, and analyze two algorithms: 1) Expectation-Maximization algorithm, which is notoriously non-convex and sensitive to local optima, and 2) Convex relaxation of the k-means algorithm. We show both methods are consistent under certain conditions when the signal to noise ratio is relatively high. And we obtain the upper bounds for error rate if the signal to noise ration is low. When there are outliers in the data set, we show that the semi-definite relaxation exhibits more robust result compared to spectral methods. For the networks, we consider the Stochastic Block Model (SBM), in which the probability of edge presence is fully determined by the cluster assignments of the pair of nodes. We use a semi-definite programming (SDP) relaxation to learn the clustering matrix, and discuss the role of model parameters. In most SDP relaxations of SBM, the number of communities is required for the algorithm, which is a strong requirement for many real-world applications. In this thesis, we propose to introduce a regularization to the nuclear norm, which is shown to be able to exactly recover both the number of communities and cluster memberships even when the number of communities is unknown. In many real-world networks, it is more common to see both network structure and node covariates simultaneously. In this case, we present a regularization based method to effectively combine the two sources of information. The proposed method works especially well when the covariates and network contain complementary information.Statistic

Texas ScholarWorks

Probabilistic Best Subset Selection via Gradient-Based Optimization

Author: Ho Nhat
Qian Xiaoning
Yan Bowei
Yin Mingzhang
Zhou Mingyuan
Publication venue
Publication date: 07/08/2020
Field of study

In high-dimensional statistics, variable selection is an optimization problem aiming to recover the latent sparse pattern from all possible covariate combinations. In this paper, we propose a novel optimization method to solve the exact

L_0

-regularized regression problem (a.k.a. best subset selection). We reformulate the optimization problem from a discrete space to a continuous one via probabilistic reparameterization. Within the framework of stochastic gradient descent, we propose a family of unbiased gradient estimators to optimize the

L_0

-regularized objective and a variational lower bound. Within this family, we identify the estimator with a non-vanishing signal-to-noise ratio and uniformly minimum variance. Theoretically, we study the general conditions under which the method is guaranteed to converge to the ground truth in expectation. In a wide variety of synthetic and semi-synthetic data sets, the proposed method outperforms existing variable selection methods that are based on penalized regression and mixed-integer optimization, in both sparse pattern recovery and out-of-sample prediction. Our method can find the true regression model from thousands of covariates in a couple of seconds.

arXiv.org e-Print Archive