108 research outputs found
Efficient modularity density heuristics in graph clustering and their applications
Modularity Density Maximization is a graph clustering problem which avoids the resolution limit degeneracy of the Modularity Maximization problem. This thesis aims at solving larger instances than current Modularity Density heuristics do, and show how close the obtained solutions are to the expected clustering. Three main contributions arise from this objective. The first one is about the theoretical contributions about properties of Modularity Density based prioritizers. The second one is the development of eight Modularity Density Maximization heuristics. Our heuristics are compared with optimal results from the literature, and with GAOD, iMeme-Net, HAIN, BMD- heuristics. Our results are also compared with CNM and Louvain which are heuristics for Modularity Maximization that solve instances with thousands of nodes. The tests were carried out by using graphs from the “Stanford Large Network Dataset Collection”. The experiments have shown that our eight heuristics found solutions for graphs with hundreds of thousands of nodes. Our results have also shown that five of our heuristics surpassed the current state-of-the-art Modularity Density Maximization heuristic solvers for large graphs. A third contribution is the proposal of six column generation methods. These methods use exact and heuristic auxiliary solvers and an initial variable generator. Comparisons among our proposed column generations and state-of-the-art algorithms were also carried out. The results showed that: (i) two of our methods surpassed the state-of-the-art algorithms in terms of time, and (ii) our methods proved the optimal value for larger instances than current approaches can tackle. Our results suggest clear improvements to the state-of-the-art results for the Modularity Density Maximization problem
Interior point methods and simulated annealing for nonsymmetric conic optimization
This thesis explores four methods for convex optimization. The first two are an interior point method and a simulated annealing algorithm that share a theoretical foundation. This connection is due to the interior point method’s use of the so-called entropic barrier, whose derivatives can be approximated through sampling. Here, the sampling will be carried out with a technique known as hit-and-run. By carefully analyzing the properties of hit-and-run sampling, it is shown that both the interior point method and the simulated annealing algorithm can solve a convex optimization problem in the membership oracle setting. The number of oracle calls made by these methods is bounded by a polynomial in the input size. The third method is an analytic center cutting plane method that shows promising performance for copositive optimization. It outperforms the first two methods by a significant margin on the problem of separating a matrix from the completely positive cone. The final method is based on Mosek’s algorithm for nonsymmetric conic optimization. With their scaling matrix, search direction, and neighborhood, we define a method that converges to a near-optimal solution in polynomial time
Recommended from our members
Theoretical analysis for convex and non-convex clustering algorithms
Clustering is one of the most important unsupervised learning problem in the machine learning and statistics community. Given a set of observations, the goal is to find the latent cluster assignment of the data points. The observations can be either some covariates corresponding to each data point, or the relational networks representing the affinity between pair of nodes. We study the problem of community detection in stochastic block models and clustering mixture models. The two kinds of problems bear a lot of resemblance, and similar techniques can be applied to solve them.
It is common practice to assume some underlying model for the data generating process in order to analyze it properly. With some pre-defined partitions of all data points, generative models can be defined to represent those two types of data observations. For the covariates, the mixture model is one of the most flexible and widely-used models, where each cluster i comes from some distribution D [subscript i], and the entire distribution is a convex sum over all distributions [mathematical equation]. We assume that the data is Gaussian or sub-gaussian, and analyze two algorithms: 1) Expectation-Maximization algorithm, which is notoriously non-convex and sensitive to local optima, and 2) Convex relaxation of the k-means algorithm. We show both methods are consistent under certain conditions when the signal to noise ratio is relatively high. And we obtain the upper bounds for error rate if the signal to noise ration is low. When there are outliers in the data set, we show that the semi-definite relaxation exhibits more robust result compared to spectral methods.
For the networks, we consider the Stochastic Block Model (SBM), in which the probability of edge presence is fully determined by the cluster assignments of the pair of nodes. We use a semi-definite programming (SDP) relaxation to learn the clustering matrix, and discuss the role of model parameters. In most SDP relaxations of SBM, the number of communities is required for the algorithm, which is a strong requirement for many real-world applications. In this thesis, we propose to introduce a regularization to the nuclear norm, which is shown to be able to exactly recover both the number of communities and cluster memberships even when the number of communities is unknown.
In many real-world networks, it is more common to see both network structure and node covariates simultaneously. In this case, we present a regularization based method to effectively combine the two sources of information. The proposed method works especially well when the covariates and network contain complementary information.Statistic
- …