80 research outputs found
Recommended from our members
Theoretical analysis for convex and non-convex clustering algorithms
Clustering is one of the most important unsupervised learning problem in the machine learning and statistics community. Given a set of observations, the goal is to find the latent cluster assignment of the data points. The observations can be either some covariates corresponding to each data point, or the relational networks representing the affinity between pair of nodes. We study the problem of community detection in stochastic block models and clustering mixture models. The two kinds of problems bear a lot of resemblance, and similar techniques can be applied to solve them.
It is common practice to assume some underlying model for the data generating process in order to analyze it properly. With some pre-defined partitions of all data points, generative models can be defined to represent those two types of data observations. For the covariates, the mixture model is one of the most flexible and widely-used models, where each cluster i comes from some distribution D [subscript i], and the entire distribution is a convex sum over all distributions [mathematical equation]. We assume that the data is Gaussian or sub-gaussian, and analyze two algorithms: 1) Expectation-Maximization algorithm, which is notoriously non-convex and sensitive to local optima, and 2) Convex relaxation of the k-means algorithm. We show both methods are consistent under certain conditions when the signal to noise ratio is relatively high. And we obtain the upper bounds for error rate if the signal to noise ration is low. When there are outliers in the data set, we show that the semi-definite relaxation exhibits more robust result compared to spectral methods.
For the networks, we consider the Stochastic Block Model (SBM), in which the probability of edge presence is fully determined by the cluster assignments of the pair of nodes. We use a semi-definite programming (SDP) relaxation to learn the clustering matrix, and discuss the role of model parameters. In most SDP relaxations of SBM, the number of communities is required for the algorithm, which is a strong requirement for many real-world applications. In this thesis, we propose to introduce a regularization to the nuclear norm, which is shown to be able to exactly recover both the number of communities and cluster memberships even when the number of communities is unknown.
In many real-world networks, it is more common to see both network structure and node covariates simultaneously. In this case, we present a regularization based method to effectively combine the two sources of information. The proposed method works especially well when the covariates and network contain complementary information.Statistic
Probabilistic Best Subset Selection via Gradient-Based Optimization
In high-dimensional statistics, variable selection is an optimization problem
aiming to recover the latent sparse pattern from all possible covariate
combinations. In this paper, we propose a novel optimization method to solve
the exact -regularized regression problem (a.k.a. best subset selection).
We reformulate the optimization problem from a discrete space to a continuous
one via probabilistic reparameterization. Within the framework of stochastic
gradient descent, we propose a family of unbiased gradient estimators to
optimize the -regularized objective and a variational lower bound. Within
this family, we identify the estimator with a non-vanishing signal-to-noise
ratio and uniformly minimum variance. Theoretically, we study the general
conditions under which the method is guaranteed to converge to the ground truth
in expectation. In a wide variety of synthetic and semi-synthetic data sets,
the proposed method outperforms existing variable selection methods that are
based on penalized regression and mixed-integer optimization, in both sparse
pattern recovery and out-of-sample prediction. Our method can find the true
regression model from thousands of covariates in a couple of seconds.
- …