170 research outputs found
Beyond the arithmetic mean : extensions of spectral clustering and semi-supervised learning for signed and multilayer graphs via matrix power means
In this thesis we present extensions of spectral clustering and semi-supervised learning to signed and multilayer graphs. These extensions are based on a one-parameter family of matrix functions called Matrix Power Means. In the scalar case, this family has the arithmetic, geometric and harmonic means as particular cases. We study the effectivity of this family of matrix functions through suitable versions of the stochastic block model to signed and multilayer graphs. We provide provable properties in expectation and further identify regimes where the state of the art fails whereas our approach provably performs well. Some of the settings that we analyze are as follows: first, the case where each layer presents a reliable approximation to the overall clustering; second, the case when one single layer has information about the clusters whereas the remaining layers are potentially just noise; third, the case when each layer has only partial information but all together show global information about the underlying clustering structure. We present extensive numerical verifications of all our results and provide matrix-free numerical schemes. With these numerical schemes we are able to show that our proposed approach based on matrix power means is scalable to large sparse signed and multilayer graphs. Finally, we evaluate our methods in real world datasets. For instance, we show that our approach consistently identifies clustering structure in a real signed network where previous approaches failed. This further verifies that our methods are competitive to the state of the art.In dieser Arbeit stellen wir Erweiterungen von spektralem Clustering und teilüberwachtem Lernen auf signierte und mehrschichtige Graphen vor. Diese Erweiterungen basieren auf einer einparametrischen Familie von Matrixfunktionen, die Potenzmittel genannt werden. Im skalaren Fall hat diese Familie die arithmetischen, geometrischen und harmonischen Mittel als Spezialfälle. Wir untersuchen die Effektivität dieser Familie von Matrixfunktionen durch Versionen des stochastischen Blockmodells, die für signierte und mehrschichtige Graphen geeignet sind. Wir stellen beweisbare Eigenschaften vor und identifizieren darüber hinaus Situationen in denen neueste, gegenwärtig verwendete Methoden versagen, während unser Ansatz nachweislich gut abschneidet. Wir untersuchen unter anderem folgende Situationen: erstens den Fall, dass jede Schicht eine zuverlässige Approximation an die Gesamtclusterung darstellt; zweitens den Fall, dass eine einzelne Schicht Informationen über die Cluster hat, während die übrigen Schichten möglicherweise nur Rauschen sind; drittens den Fall, dass jede Schicht nur partielle Informationen hat, aber alle zusammen globale Informationen über die zugrunde liegende Clusterstruktur liefern. Wir präsentieren umfangreiche numerische Verifizierungen aller unserer Ergebnisse und stellen matrixfreie numerische Verfahren zur Verfügung. Mit diesen numerischen Methoden sind wir in der Lage zu zeigen, dass unser vorgeschlagener Ansatz, der auf Potenzmitteln basiert, auf große, dünnbesetzte signierte und mehrschichtige Graphen skalierbar ist. Schließlich evaluieren wir unsere Methoden an realen Datensätzen. Zum Beispiel zeigen wir, dass unser Ansatz konsistent Clustering-Strukturen in einem realen signierten Netzwerk identifiziert, wo frühere Ansätze versagten. Dies ist ein weiterer Nachweis, dass unsere Methoden konkurrenzfähig zu den aktuell verwendeten Methoden sind
Hypothesis Testing For Network Data in Functional Neuroimaging
In recent years, it has become common practice in neuroscience to use
networks to summarize relational information in a set of measurements,
typically assumed to be reflective of either functional or structural
relationships between regions of interest in the brain. One of the most basic
tasks of interest in the analysis of such data is the testing of hypotheses, in
answer to questions such as "Is there a difference between the networks of
these two groups of subjects?" In the classical setting, where the unit of
interest is a scalar or a vector, such questions are answered through the use
of familiar two-sample testing strategies. Networks, however, are not Euclidean
objects, and hence classical methods do not directly apply. We address this
challenge by drawing on concepts and techniques from geometry, and
high-dimensional statistical inference. Our work is based on a precise
geometric characterization of the space of graph Laplacian matrices and a
nonparametric notion of averaging due to Fr\'echet. We motivate and illustrate
our resulting methodologies for testing in the context of networks derived from
functional neuroimaging data on human subjects from the 1000 Functional
Connectomes Project. In particular, we show that this global test is more
statistical powerful, than a mass-univariate approach. In addition, we have
also provided a method for visualizing the individual contribution of each edge
to the overall test statistic.Comment: 34 pages. 5 figure
A nonparametric two-sample hypothesis testing problem for random dot product graphs
We consider the problem of testing whether two finite-dimensional random dot
product graphs have generating latent positions that are independently drawn
from the same distribution, or distributions that are related via scaling or
projection. We propose a test statistic that is a kernel-based function of the
adjacency spectral embedding for each graph. We obtain a limiting distribution
for our test statistic under the null and we show that our test procedure is
consistent across a broad range of alternatives.Comment: 24 pages, 1 figure
Community Detection and Classification Guarantees Using Embeddings Learned by Node2Vec
Embedding the nodes of a large network into an Euclidean space is a common
objective in modern machine learning, with a variety of tools available. These
embeddings can then be used as features for tasks such as community
detection/node clustering or link prediction, where they achieve state of the
art performance. With the exception of spectral clustering methods, there is
little theoretical understanding for other commonly used approaches to learning
embeddings. In this work we examine the theoretical properties of the
embeddings learned by node2vec. Our main result shows that the use of k-means
clustering on the embedding vectors produced by node2vec gives weakly
consistent community recovery for the nodes in (degree corrected) stochastic
block models. We also discuss the use of these embeddings for node and link
prediction tasks. We demonstrate this result empirically, and examine how this
relates to other embedding tools for network data
Foundations of Adjacency Spectral Embedding
The eigendecomposition of an adjacency matrix provides a way to embed a graph as points in finite dimensional Euclidean space.
This embedding allows the full arsenal of statistical and machine learning methodology for multivariate Euclidean data to be deployed for graph inference.
Our work analyzes this embedding, a graph version of principal component analysis, in the context of various random graph models with a focus on the impact for subsequent inference.
For the stochastic blockmodel, with a finite number of blocks of stochastically equivalent vertices, Sussman, et al (2012),
Fishkind, et al (2013) and Lyzinski, et al (2013) show that clustering the embedded points using k-means accurately partitions the vertices into the correct blocks, even when the embedding dimension is misspecified or the number of blocks is unknown.
For the more general random dot product graph model, an example of a latent position model, Sussman, et al (2013) shows that the latent positions are consistently estimated by the embedding which then allows for accurate learning in a supervised vertex classification framework. Tang, et al (2012) strengthens these results to more general latent position models.
Athreya, et al (2013) provide distributional results, akin to a central limit theorem, for the residuals between the estimated and true latent positions which provides the potential for deeper understanding of these methods.
In summary, these papers demonstrate that for a broad class of graph models and inference tasks, adjacency-spectral embedding allows for accurate graph inference via standard multivariate methodology
- …