2,898 research outputs found
Network Alignment by Discrete Ollivier-Ricci Flow
In this paper, we consider the problem of approximately aligning/matching two
graphs. Given two graphs and , the
objective is to map nodes to nodes such that when
have an edge in , very likely their corresponding nodes in
are connected as well. This problem with subgraph isomorphism as a
special case has extra challenges when we consider matching complex networks
exhibiting the small world phenomena. In this work, we propose to use `Ricci
flow metric', to define the distance between two nodes in a network. This is
then used to define similarity of a pair of nodes in two networks respectively,
which is the crucial step of network alignment. %computed by discrete graph
curvatures and graph Ricci flows. Specifically, the Ricci curvature of an edge
describes intuitively how well the local neighborhood is connected. The graph
Ricci flow uniformizes discrete Ricci curvature and induces a Ricci flow metric
that is insensitive to node/edge insertions and deletions. With the new metric,
we can map a node in to a node in whose distance vector to only a
few preselected landmarks is the most similar. The robustness of the graph
metric makes it outperform other methods when tested on various complex graph
models and real world network data sets (Emails, Internet, and protein
interaction networks)\footnote{The source code of computing Ricci curvature and
Ricci flow metric are available:
https://github.com/saibalmars/GraphRicciCurvature}.Comment: Appears in the Proceedings of the 26th International Symposium on
Graph Drawing and Network Visualization (GD 2018
On Approximation Guarantees for Greedy Low Rank Optimization
We provide new approximation guarantees for greedy low rank matrix estimation
under standard assumptions of restricted strong convexity and smoothness. Our
novel analysis also uncovers previously unknown connections between the low
rank estimation and combinatorial optimization, so much so that our bounds are
reminiscent of corresponding approximation bounds in submodular maximization.
Additionally, we also provide statistical recovery guarantees. Finally, we
present empirical comparison of greedy estimation with established baselines on
two important real-world problems
Graph2Seq: Scalable Learning Dynamics for Graphs
Neural networks have been shown to be an effective tool for learning
algorithms over graph-structured data. However, graph representation
techniques---that convert graphs to real-valued vectors for use with neural
networks---are still in their infancy. Recent works have proposed several
approaches (e.g., graph convolutional networks), but these methods have
difficulty scaling and generalizing to graphs with different sizes and shapes.
We present Graph2Seq, a new technique that represents vertices of graphs as
infinite time-series. By not limiting the representation to a fixed dimension,
Graph2Seq scales naturally to graphs of arbitrary sizes and shapes. Graph2Seq
is also reversible, allowing full recovery of the graph structure from the
sequences. By analyzing a formal computational model for graph representation,
we show that an unbounded sequence is necessary for scalability. Our
experimental results with Graph2Seq show strong generalization and new
state-of-the-art performance on a variety of graph combinatorial optimization
problems
Absorbing random-walk centrality: Theory and algorithms
We study a new notion of graph centrality based on absorbing random walks.
Given a graph and a set of query nodes , we aim to
identify the most central nodes in with respect to . Specifically,
we consider central nodes to be absorbing for random walks that start at the
query nodes . The goal is to find the set of central nodes that
minimizes the expected length of a random walk until absorption. The proposed
measure, which we call absorbing random-walk centrality, favors diverse
sets, as it is beneficial to place the absorbing nodes in different parts
of the graph so as to "intercept" random walks that start from different query
nodes.
Although similar problem definitions have been considered in the literature,
e.g., in information-retrieval settings where the goal is to diversify
web-search results, in this paper we study the problem formally and prove some
of its properties. We show that the problem is NP-hard, while the objective
function is monotone and supermodular, implying that a greedy algorithm
provides solutions with an approximation guarantee. On the other hand, the
greedy algorithm involves expensive matrix operations that make it prohibitive
to employ on large datasets. To confront this challenge, we develop more
efficient algorithms based on spectral clustering and on personalized PageRank.Comment: 11 pages, 11 figures, short paper to appear at ICDM 201
Ranking ideas for diversity and quality
When selecting ideas or trying to find inspiration, designers often must sift
through hundreds or thousands of ideas. This paper provides an algorithm to
rank design ideas such that the ranked list simultaneously maximizes the
quality and diversity of recommended designs. To do so, we first define and
compare two diversity measures using Determinantal Point Processes (DPP) and
additive sub-modular functions. We show that DPPs are more suitable for items
expressed as text and that a greedy algorithm diversifies rankings with both
theoretical guarantees and empirical performance on what is otherwise an
NP-Hard problem. To produce such rankings, this paper contributes a novel way
to extend quality and diversity metrics from sets to permutations of ranked
lists.
These rank metrics open up the use of multi-objective optimization to
describe trade-offs between diversity and quality in ranked lists. We use such
trade-off fronts to help designers select rankings using indifference curves.
However, we also show that rankings on trade-off front share a number of
top-ranked items; this means reviewing items (for a given depth like the top
10) from across the entire diversity-to-quality front incurs only a marginal
increase in the number of designs considered. While the proposed techniques are
general purpose enough to be used across domains, we demonstrate concrete
performance on selecting items in an online design community (OpenIDEO), where
our approach reduces the time required to review diverse, high-quality ideas
from around 25 hours to 90 minutes. This makes evaluation of crowd-generated
ideas tractable for a single designer. Our code is publicly accessible for
further research
Global and Local Structure Preserving Sparse Subspace Learning: An Iterative Approach to Unsupervised Feature Selection
As we aim at alleviating the curse of high-dimensionality, subspace learning
is becoming more popular. Existing approaches use either information about
global or local structure of the data, and few studies simultaneously focus on
global and local structures as the both of them contain important information.
In this paper, we propose a global and local structure preserving sparse
subspace learning (GLoSS) model for unsupervised feature selection. The model
can simultaneously realize feature selection and subspace learning. In
addition, we develop a greedy algorithm to establish a generic combinatorial
model, and an iterative strategy based on an accelerated block coordinate
descent is used to solve the GLoSS problem. We also provide whole iterate
sequence convergence analysis of the proposed iterative algorithm. Extensive
experiments are conducted on real-world datasets to show the superiority of the
proposed approach over several state-of-the-art unsupervised feature selection
approaches.Comment: 32 page, 6 figures and 60 reference
Greedy Column Subset Selection for Large-scale Data Sets
In today's information systems, the availability of massive amounts of data
necessitates the development of fast and accurate algorithms to summarize these
data and represent them in a succinct format. One crucial problem in big data
analytics is the selection of representative instances from large and
massively-distributed data, which is formally known as the Column Subset
Selection (CSS) problem. The solution to this problem enables data analysts to
understand the insights of the data and explore its hidden structure. The
selected instances can also be used for data preprocessing tasks such as
learning a low-dimensional embedding of the data points or computing a low-rank
approximation of the corresponding matrix. This paper presents a fast and
accurate greedy algorithm for large-scale column subset selection. The
algorithm minimizes an objective function which measures the reconstruction
error of the data matrix based on the subset of selected columns. The paper
first presents a centralized greedy algorithm for column subset selection which
depends on a novel recursive formula for calculating the reconstruction error
of the data matrix. The paper then presents a MapReduce algorithm which selects
a few representative columns from a matrix whose columns are massively
distributed across several commodity machines. The algorithm first learns a
concise representation of all columns using random projection, and it then
solves a generalized column subset selection problem at each machine in which a
subset of columns are selected from the sub-matrix on that machine such that
the reconstruction error of the concise representation is minimized. The paper
demonstrates the effectiveness and efficiency of the proposed algorithm through
an empirical evaluation on benchmark data sets.Comment: Under consideration for publication in Knowledge and Information
System
Self-Expressive Decompositions for Matrix Approximation and Clustering
Data-aware methods for dimensionality reduction and matrix decomposition aim
to find low-dimensional structure in a collection of data. Classical approaches
discover such structure by learning a basis that can efficiently express the
collection. Recently, "self expression", the idea of using a small subset of
data vectors to represent the full collection, has been developed as an
alternative to learning. Here, we introduce a scalable method for computing
sparse SElf-Expressive Decompositions (SEED). SEED is a greedy method that
constructs a basis by sequentially selecting incoherent vectors from the
dataset. After forming a basis from a subset of vectors in the dataset, SEED
then computes a sparse representation of the dataset with respect to this
basis. We develop sufficient conditions under which SEED exactly represents low
rank matrices and vectors sampled from a unions of independent subspaces. We
show how SEED can be used in applications ranging from matrix approximation and
denoising to clustering, and apply it to numerous real-world datasets. Our
results demonstrate that SEED is an attractive low-complexity alternative to
other sparse matrix factorization approaches such as sparse PCA and
self-expressive methods for clustering.Comment: 11 pages, 7 figure
On landmark selection and sampling in high-dimensional data analysis
In recent years, the spectral analysis of appropriately defined kernel
matrices has emerged as a principled way to extract the low-dimensional
structure often prevalent in high-dimensional data. Here we provide an
introduction to spectral methods for linear and nonlinear dimension reduction,
emphasizing ways to overcome the computational limitations currently faced by
practitioners with massive datasets. In particular, a data subsampling or
landmark selection process is often employed to construct a kernel based on
partial information, followed by an approximate spectral analysis termed the
Nystrom extension. We provide a quantitative framework to analyse this
procedure, and use it to demonstrate algorithmic performance bounds on a range
of practical approaches designed to optimize the landmark selection process. We
compare the practical implications of these bounds by way of real-world
examples drawn from the field of computer vision, whereby low-dimensional
manifold structure is shown to emerge from high-dimensional video data streams.Comment: 18 pages, 6 figures, submitted for publicatio
Learning Generative Models of Similarity Matrices
We describe a probabilistic (generative) view of affinity matrices along with
inference algorithms for a subclass of problems associated with data
clustering. This probabilistic view is helpful in understanding different
models and algorithms that are based on affinity functions OF the data. IN
particular, we show how(greedy) inference FOR a specific probabilistic model IS
equivalent TO the spectral clustering algorithm.It also provides a framework
FOR developing new algorithms AND extended models. AS one CASE, we present new
generative data clustering models that allow us TO infer the underlying
distance measure suitable for the clustering problem at hand. These models seem
to perform well in a larger class of problems for which other clustering
algorithms (including spectral clustering) usually fail. Experimental
evaluation was performed in a variety point data sets, showing excellent
performance.Comment: Appears in Proceedings of the Nineteenth Conference on Uncertainty in
Artificial Intelligence (UAI2003
- …