3,998 research outputs found
The impossibility of low rank representations for triangle-rich complex networks
The study of complex networks is a significant development in modern science,
and has enriched the social sciences, biology, physics, and computer science.
Models and algorithms for such networks are pervasive in our society, and
impact human behavior via social networks, search engines, and recommender
systems to name a few. A widely used algorithmic technique for modeling such
complex networks is to construct a low-dimensional Euclidean embedding of the
vertices of the network, where proximity of vertices is interpreted as the
likelihood of an edge. Contrary to the common view, we argue that such graph
embeddings do not}capture salient properties of complex networks. The two
properties we focus on are low degree and large clustering coefficients, which
have been widely established to be empirically true for real-world networks. We
mathematically prove that any embedding (that uses dot products to measure
similarity) that can successfully create these two properties must have rank
nearly linear in the number of vertices. Among other implications, this
establishes that popular embedding techniques such as Singular Value
Decomposition and node2vec fail to capture significant structural aspects of
real-world complex networks. Furthermore, we empirically study a number of
different embedding techniques based on dot product, and show that they all
fail to capture the triangle structure
Recommended from our members
Foundations of Node Representation Learning
Low-dimensional node representations, also called node embeddings, are a cornerstone in the modeling and analysis of complex networks. In recent years, advances in deep learning have spurred development of novel neural network-inspired methods for learning node representations which have largely surpassed classical \u27spectral\u27 embeddings in performance. Yet little work asks the central questions of this thesis: Why do these novel deep methods outperform their classical predecessors, and what are their limitations?
We pursue several paths to answering these questions. To further our understanding of deep embedding methods, we explore their relationship with spectral methods, which are better understood, and show that some popular deep methods are equivalent to spectral methods in a certain natural limit. We also introduce the problem of inverting node embeddings in order to probe what information they contain. Further, we propose a simple, non-deep method for node representation learning, and find it to often be competitive with modern deep graph networks in downstream performance.
To better understand the limitations of node embeddings, we prove some upper and lower bounds on their capabilities. Most notably, we prove that node embeddings are capable of exact low-dimensional representation of networks with bounded max degree or arboricity, and we further show that a simple algorithm can find such exact embeddings for real-world networks. By contrast, we also prove inherent bounds on random graph models, including those derived from node embeddings, to capture key structural properties of networks without simply memorizing a given graph
Implications of sparsity and high triangle density for graph representation learning
Recent work has shown that sparse graphs containing many triangles cannot be
reproduced using a finite-dimensional representation of the nodes, in which
link probabilities are inner products. Here, we show that such graphs can be
reproduced using an infinite-dimensional inner product model, where the node
representations lie on a low-dimensional manifold. Recovering a global
representation of the manifold is impossible in a sparse regime. However, we
can zoom in on local neighbourhoods, where a lower-dimensional representation
is possible. As our constructions allow the points to be uniformly distributed
on the manifold, we find evidence against the common perception that triangles
imply community structure
Extending adjacency matrices to 3D with triangles
Social networks are the fabric of society and the subject of frequent visual
analysis. Closed triads represent triangular relationships between three people
in a social network and are significant for understanding inherent
interconnections and influence within the network. The most common methods for
representing social networks (node-link diagrams and adjacency matrices) are
not optimal for understanding triangles. We propose extending the adjacency
matrix form to 3D for better visualization of network triads. We design a 3D
matrix reordering technique and implement an immersive interactive system to
assist in visualizing and analyzing closed triads in social networks. A user
study and usage scenarios demonstrate that our method provides substantial
added value over node-link diagrams in improving the efficiency and accuracy of
manipulating and understanding the social network triads.Comment: 10 pages, 8 figures and 3 table
Exact Representation of Sparse Networks with Symmetric Nonnegative Embeddings
Many models for undirected graphs are based on factorizing the graph's
adjacency matrix; these models find a vector representation of each node such
that the predicted probability of a link between two nodes increases with the
similarity (dot product) of their associated vectors. Recent work has shown
that these models are unable to capture key structures in real-world graphs,
particularly heterophilous structures, wherein links occur between dissimilar
nodes. In contrast, a factorization with two vectors per node, based on
logistic principal components analysis (LPCA), has been proven not only to
represent such structures, but also to provide exact low-rank factorization of
any graph with bounded max degree. However, this bound has limited
applicability to real-world networks, which often have power law degree
distributions with high max degree. Further, the LPCA model lacks
interpretability since its asymmetric factorization does not reflect the
undirectedness of the graph. We address these issues in two ways. First, we
prove a new bound for the LPCA model in terms of arboricity rather than max
degree; this greatly increases the bound's applicability to many sparse
real-world networks. Second, we propose an alternative graph model whose
factorization is symmetric and nonnegative, which allows for link predictions
to be interpreted in terms of node clusters. We show that the bounds for exact
representation in the LPCA model extend to our new model. On the empirical
side, our model is optimized effectively on real-world graphs with gradient
descent on a cross-entropy loss. We demonstrate its effectiveness on a variety
of foundational tasks, such as community detection and link prediction
BOOL-AN: A method for comparative sequence analysis and phylogenetic reconstruction
A novel discrete mathematical approach is proposed as an additional tool for molecular systematics which does not require prior statistical assumptions concerning the evolutionary process. The method is based on algorithms generating mathematical representations directly from DNA/RNA or protein sequences, followed by the output of numerical (scalar or vector) and visual characteristics (graphs). The binary encoded sequence information is transformed into a compact analytical form, called the Iterative Canonical Form (or ICF) of Boolean functions, which can then be used as a generalized molecular descriptor. The method provides raw vector data for calculating different distance matrices, which in turn can be analyzed by neighbor-joining or UPGMA to derive a phylogenetic tree, or by principal coordinates analysis to get an ordination scattergram. The new method and the associated software for inferring phylogenetic trees are called the Boolean analysis or BOOL-AN
Uncertainty in Natural Language Generation: From Theory to Applications
Recent advances of powerful Language Models have allowed Natural Language
Generation (NLG) to emerge as an important technology that can not only perform
traditional tasks like summarisation or translation, but also serve as a
natural language interface to a variety of applications. As such, it is crucial
that NLG systems are trustworthy and reliable, for example by indicating when
they are likely to be wrong; and supporting multiple views, backgrounds and
writing styles -- reflecting diverse human sub-populations. In this paper, we
argue that a principled treatment of uncertainty can assist in creating systems
and evaluation protocols better aligned with these goals. We first present the
fundamental theory, frameworks and vocabulary required to represent
uncertainty. We then characterise the main sources of uncertainty in NLG from a
linguistic perspective, and propose a two-dimensional taxonomy that is more
informative and faithful than the popular aleatoric/epistemic dichotomy.
Finally, we move from theory to applications and highlight exciting research
directions that exploit uncertainty to power decoding, controllable generation,
self-assessment, selective answering, active learning and more
- …