229 research outputs found
Streaming Similarity Self-Join
We introduce and study the problem of computing the similarity self-join in a
streaming context (SSSJ), where the input is an unbounded stream of items
arriving continuously. The goal is to find all pairs of items in the stream
whose similarity is greater than a given threshold. The simplest formulation of
the problem requires unbounded memory, and thus, it is intractable. To make the
problem feasible, we introduce the notion of time-dependent similarity: the
similarity of two items decreases with the difference in their arrival time. By
leveraging the properties of this time-dependent similarity function, we design
two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch
(MB), uses existing index-based filtering techniques for the static version of
the problem, and combines them in a pipeline. The second framework, Streaming
(STR), adds time filtering to the existing indexes, and integrates new
time-based bounds deeply in the working of the algorithms. We also introduce a
new indexing technique (L2), which is based on an existing state-of-the-art
indexing technique (L2AP), but is optimized for the streaming case. Extensive
experiments show that the STR algorithm, when instantiated with the L2 index,
is the most scalable option across a wide array of datasets and parameters
Absorbing random-walk centrality: Theory and algorithms
We study a new notion of graph centrality based on absorbing random walks.
Given a graph and a set of query nodes , we aim to
identify the most central nodes in with respect to . Specifically,
we consider central nodes to be absorbing for random walks that start at the
query nodes . The goal is to find the set of central nodes that
minimizes the expected length of a random walk until absorption. The proposed
measure, which we call absorbing random-walk centrality, favors diverse
sets, as it is beneficial to place the absorbing nodes in different parts
of the graph so as to "intercept" random walks that start from different query
nodes.
Although similar problem definitions have been considered in the literature,
e.g., in information-retrieval settings where the goal is to diversify
web-search results, in this paper we study the problem formally and prove some
of its properties. We show that the problem is NP-hard, while the objective
function is monotone and supermodular, implying that a greedy algorithm
provides solutions with an approximation guarantee. On the other hand, the
greedy algorithm involves expensive matrix operations that make it prohibitive
to employ on large datasets. To confront this challenge, we develop more
efficient algorithms based on spectral clustering and on personalized PageRank.Comment: 11 pages, 11 figures, short paper to appear at ICDM 201
Community-aware network sparsification
Network sparsification aims to reduce the number of edges of a network while maintaining its structural properties; such properties include shortest paths, cuts, spectral measures, or network modularity. Sparsification has multiple applications, such as, speeding up graph-mining algorithms, graph visualization, as well as identifying the important network edges.
In this paper we consider a novel formulation of the network-sparsification problem. In addition to the network, we also consider as input a set of communities. The goal is to sparsify the network so as to preserve the network structure with respect to the given communities. We introduce two variants of the community-aware sparsification problem, leading to sparsifiers that satisfy different connectedness community properties. From the technical point of view, we prove hardness results and devise effective approximation algorithms. Our experimental results on a large collection of datasets demonstrate the effectiveness of our algorithms.https://epubs.siam.org/doi/10.1137/1.9781611974973.48Accepted manuscrip
A Motif-based Approach for Identifying Controversy
Among the topics discussed in Social Media, some lead to controversy. A
number of recent studies have focused on the problem of identifying controversy
in social media mostly based on the analysis of textual content or rely on
global network structure. Such approaches have strong limitations due to the
difficulty of understanding natural language, and of investigating the global
network structure. In this work we show that it is possible to detect
controversy in social media by exploiting network motifs, i.e., local patterns
of user interaction. The proposed approach allows for a language-independent
and fine- grained and efficient-to-compute analysis of user discussions and
their evolution over time. The supervised model exploiting motif patterns can
achieve 85% accuracy, with an improvement of 7% compared to baseline
structural, propagation-based and temporal network features
- …