647 research outputs found
Streaming Similarity Self-Join
We introduce and study the problem of computing the similarity self-join in a
streaming context (SSSJ), where the input is an unbounded stream of items
arriving continuously. The goal is to find all pairs of items in the stream
whose similarity is greater than a given threshold. The simplest formulation of
the problem requires unbounded memory, and thus, it is intractable. To make the
problem feasible, we introduce the notion of time-dependent similarity: the
similarity of two items decreases with the difference in their arrival time. By
leveraging the properties of this time-dependent similarity function, we design
two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch
(MB), uses existing index-based filtering techniques for the static version of
the problem, and combines them in a pipeline. The second framework, Streaming
(STR), adds time filtering to the existing indexes, and integrates new
time-based bounds deeply in the working of the algorithms. We also introduce a
new indexing technique (L2), which is based on an existing state-of-the-art
indexing technique (L2AP), but is optimized for the streaming case. Extensive
experiments show that the STR algorithm, when instantiated with the L2 index,
is the most scalable option across a wide array of datasets and parameters
Absorbing random-walk centrality: Theory and algorithms
We study a new notion of graph centrality based on absorbing random walks.
Given a graph and a set of query nodes , we aim to
identify the most central nodes in with respect to . Specifically,
we consider central nodes to be absorbing for random walks that start at the
query nodes . The goal is to find the set of central nodes that
minimizes the expected length of a random walk until absorption. The proposed
measure, which we call absorbing random-walk centrality, favors diverse
sets, as it is beneficial to place the absorbing nodes in different parts
of the graph so as to "intercept" random walks that start from different query
nodes.
Although similar problem definitions have been considered in the literature,
e.g., in information-retrieval settings where the goal is to diversify
web-search results, in this paper we study the problem formally and prove some
of its properties. We show that the problem is NP-hard, while the objective
function is monotone and supermodular, implying that a greedy algorithm
provides solutions with an approximation guarantee. On the other hand, the
greedy algorithm involves expensive matrix operations that make it prohibitive
to employ on large datasets. To confront this challenge, we develop more
efficient algorithms based on spectral clustering and on personalized PageRank.Comment: 11 pages, 11 figures, short paper to appear at ICDM 201
Community-aware network sparsification
Network sparsification aims to reduce the number of edges of a network while maintaining its structural properties; such properties include shortest paths, cuts, spectral measures, or network modularity. Sparsification has multiple applications, such as, speeding up graph-mining algorithms, graph visualization, as well as identifying the important network edges.
In this paper we consider a novel formulation of the network-sparsification problem. In addition to the network, we also consider as input a set of communities. The goal is to sparsify the network so as to preserve the network structure with respect to the given communities. We introduce two variants of the community-aware sparsification problem, leading to sparsifiers that satisfy different connectedness community properties. From the technical point of view, we prove hardness results and devise effective approximation algorithms. Our experimental results on a large collection of datasets demonstrate the effectiveness of our algorithms.https://epubs.siam.org/doi/10.1137/1.9781611974973.48Accepted manuscrip
Injecting Uncertainty in Graphs for Identity Obfuscation
Data collected nowadays by social-networking applications create fascinating
opportunities for building novel services, as well as expanding our
understanding about social structures and their dynamics. Unfortunately,
publishing social-network graphs is considered an ill-advised practice due to
privacy concerns. To alleviate this problem, several anonymization methods have
been proposed, aiming at reducing the risk of a privacy breach on the published
data, while still allowing to analyze them and draw relevant conclusions. In
this paper we introduce a new anonymization approach that is based on injecting
uncertainty in social graphs and publishing the resulting uncertain graphs.
While existing approaches obfuscate graph data by adding or removing edges
entirely, we propose using a finer-grained perturbation that adds or removes
edges partially: this way we can achieve the same desired level of obfuscation
with smaller changes in the data, thus maintaining higher utility. Our
experiments on real-world networks confirm that at the same level of identity
obfuscation our method provides higher usefulness than existing randomized
methods that publish standard graphs.Comment: VLDB201
Maximizing the Diversity of Exposure in a Social Network
Social-media platforms have created new ways for citizens to stay informed
and participate in public debates. However, to enable a healthy environment for
information sharing, social deliberation, and opinion formation, citizens need
to be exposed to sufficiently diverse viewpoints that challenge their
assumptions, instead of being trapped inside filter bubbles. In this paper, we
take a step in this direction and propose a novel approach to maximize the
diversity of exposure in a social network. We formulate the problem in the
context of information propagation, as a task of recommending a small number of
news articles to selected users. We propose a realistic setting where we take
into account content and user leanings, and the probability of further sharing
an article. This setting allows us to capture the balance between maximizing
the spread of information and ensuring the exposure of users to diverse
viewpoints.
The resulting problem can be cast as maximizing a monotone and submodular
function subject to a matroid constraint on the allocation of articles to
users. It is a challenging generalization of the influence maximization
problem. Yet, we are able to devise scalable approximation algorithms by
introducing a novel extension to the notion of random reverse-reachable sets.
We experimentally demonstrate the efficiency and scalability of our algorithm
on several real-world datasets
A Motif-based Approach for Identifying Controversy
Among the topics discussed in Social Media, some lead to controversy. A
number of recent studies have focused on the problem of identifying controversy
in social media mostly based on the analysis of textual content or rely on
global network structure. Such approaches have strong limitations due to the
difficulty of understanding natural language, and of investigating the global
network structure. In this work we show that it is possible to detect
controversy in social media by exploiting network motifs, i.e., local patterns
of user interaction. The proposed approach allows for a language-independent
and fine- grained and efficient-to-compute analysis of user discussions and
their evolution over time. The supervised model exploiting motif patterns can
achieve 85% accuracy, with an improvement of 7% compared to baseline
structural, propagation-based and temporal network features
- …