229 research outputs found
Streaming Similarity Self-Join
We introduce and study the problem of computing the similarity self-join in a
streaming context (SSSJ), where the input is an unbounded stream of items
arriving continuously. The goal is to find all pairs of items in the stream
whose similarity is greater than a given threshold. The simplest formulation of
the problem requires unbounded memory, and thus, it is intractable. To make the
problem feasible, we introduce the notion of time-dependent similarity: the
similarity of two items decreases with the difference in their arrival time. By
leveraging the properties of this time-dependent similarity function, we design
two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch
(MB), uses existing index-based filtering techniques for the static version of
the problem, and combines them in a pipeline. The second framework, Streaming
(STR), adds time filtering to the existing indexes, and integrates new
time-based bounds deeply in the working of the algorithms. We also introduce a
new indexing technique (L2), which is based on an existing state-of-the-art
indexing technique (L2AP), but is optimized for the streaming case. Extensive
experiments show that the STR algorithm, when instantiated with the L2 index,
is the most scalable option across a wide array of datasets and parameters
Absorbing random-walk centrality: Theory and algorithms
We study a new notion of graph centrality based on absorbing random walks.
Given a graph and a set of query nodes , we aim to
identify the most central nodes in with respect to . Specifically,
we consider central nodes to be absorbing for random walks that start at the
query nodes . The goal is to find the set of central nodes that
minimizes the expected length of a random walk until absorption. The proposed
measure, which we call absorbing random-walk centrality, favors diverse
sets, as it is beneficial to place the absorbing nodes in different parts
of the graph so as to "intercept" random walks that start from different query
nodes.
Although similar problem definitions have been considered in the literature,
e.g., in information-retrieval settings where the goal is to diversify
web-search results, in this paper we study the problem formally and prove some
of its properties. We show that the problem is NP-hard, while the objective
function is monotone and supermodular, implying that a greedy algorithm
provides solutions with an approximation guarantee. On the other hand, the
greedy algorithm involves expensive matrix operations that make it prohibitive
to employ on large datasets. To confront this challenge, we develop more
efficient algorithms based on spectral clustering and on personalized PageRank.Comment: 11 pages, 11 figures, short paper to appear at ICDM 201
Community-aware network sparsification
Network sparsification aims to reduce the number of edges of a network while maintaining its structural properties; such properties include shortest paths, cuts, spectral measures, or network modularity. Sparsification has multiple applications, such as, speeding up graph-mining algorithms, graph visualization, as well as identifying the important network edges.
In this paper we consider a novel formulation of the network-sparsification problem. In addition to the network, we also consider as input a set of communities. The goal is to sparsify the network so as to preserve the network structure with respect to the given communities. We introduce two variants of the community-aware sparsification problem, leading to sparsifiers that satisfy different connectedness community properties. From the technical point of view, we prove hardness results and devise effective approximation algorithms. Our experimental results on a large collection of datasets demonstrate the effectiveness of our algorithms.https://epubs.siam.org/doi/10.1137/1.9781611974973.48Accepted manuscrip
Diameter Minimization by Shortcutting with Degree Constraints
We consider the problem of adding a fixed number of new edges to an
undirected graph in order to minimize the diameter of the augmented graph, and
under the constraint that the number of edges added for each vertex is bounded
by an integer. The problem is motivated by network-design applications, where
we want to minimize the worst case communication in the network without
excessively increasing the degree of any single vertex, so as to avoid
additional overload. We present three algorithms for this task, each with their
own merits. The special case of a matching augmentation, when every vertex can
be incident to at most one new edge, is of particular interest, for which we
show an inapproximability result, and provide bounds on the smallest achievable
diameter when these edges are added to a path. Finally, we empirically evaluate
and compare our algorithms on several real-life networks of varying types.Comment: A shorter version of this work has been accepted at the IEEE ICDM
2022 conferenc
- β¦