91 research outputs found
EsPRESSo: Efficient Privacy-Preserving Evaluation of Sample Set Similarity
Electronic information is increasingly often shared among entities without
complete mutual trust. To address related security and privacy issues, a few
cryptographic techniques have emerged that support privacy-preserving
information sharing and retrieval. One interesting open problem in this context
involves two parties that need to assess the similarity of their datasets, but
are reluctant to disclose their actual content. This paper presents an
efficient and provably-secure construction supporting the privacy-preserving
evaluation of sample set similarity, where similarity is measured as the
Jaccard index. We present two protocols: the first securely computes the
(Jaccard) similarity of two sets, and the second approximates it, using MinHash
techniques, with lower complexities. We show that our novel protocols are
attractive in many compelling applications, including document/multimedia
similarity, biometric authentication, and genetic tests. In the process, we
demonstrate that our constructions are appreciably more efficient than prior
work.Comment: A preliminary version of this paper was published in the Proceedings
of the 7th ESORICS International Workshop on Digital Privacy Management (DPM
2012). This is the full version, appearing in the Journal of Computer
Securit
LASAGNE: Locality And Structure Aware Graph Node Embedding
In this work we propose Lasagne, a methodology to learn locality and
structure aware graph node embeddings in an unsupervised way. In particular, we
show that the performance of existing random-walk based approaches depends
strongly on the structural properties of the graph, e.g., the size of the
graph, whether the graph has a flat or upward-sloping Network Community Profile
(NCP), whether the graph is expander-like, whether the classes of interest are
more k-core-like or more peripheral, etc. For larger graphs with flat NCPs that
are strongly expander-like, existing methods lead to random walks that expand
rapidly, touching many dissimilar nodes, thereby leading to lower-quality
vector representations that are less useful for downstream tasks. Rather than
relying on global random walks or neighbors within fixed hop distances, Lasagne
exploits strongly local Approximate Personalized PageRank stationary
distributions to more precisely engineer local information into node
embeddings. This leads, in particular, to more meaningful and more useful
vector representations of nodes in poorly-structured graphs. We show that
Lasagne leads to significant improvement in downstream multi-label
classification for larger graphs with flat NCPs, that it is comparable for
smaller graphs with upward-sloping NCPs, and that is comparable to existing
methods for link prediction tasks
Mining Marked Nodes in Large Graphs
abstract: With the rise of the Big Data Era, an exponential amount of network data is being generated at an unprecedented rate across a wide-range of high impact micro and macro areas of research---from protein interaction to social networks. The critical challenge is translating this large scale network data into actionable information.
A key task in the data translation is the analysis of network connectivity via marked nodes---the primary focus of our research. We have developed a framework for analyzing network connectivity via marked nodes in large scale graphs, utilizing novel algorithms in three interrelated areas: (1) analysis of a single seed node via it’s ego-centric network (AttriPart algorithm); (2) pathway identification between two seed nodes (K-Simple Shortest Paths Multithreaded and Search Reduced (KSSPR) algorithm); and (3) tree detection, defining the interaction between three or more seed nodes (Shortest Path MST algorithm).
In an effort to address both fundamental and applied research issues, we have developed the LocalForcasting algorithm to explore how network connectivity analysis can be applied to local community evolution and recommender systems. The goal is to apply the LocalForecasting algorithm to various domains---e.g., friend suggestions in social networks or future collaboration in co-authorship networks. This algorithm utilizes link prediction in combination with the AttriPart algorithm to predict future connections in local graph partitions.
Results show that our proposed AttriPart algorithm finds up to 1.6x denser local partitions, while running approximately 43x faster than traditional local partitioning techniques (PageRank-Nibble). In addition, our LocalForecasting algorithm demonstrates a significant improvement in the number of nodes and edges correctly predicted over baseline methods. Furthermore, results for the KSSPR algorithm demonstrate a speed-up of up to 2.5x the standard k-simple shortest paths algorithm.Dissertation/ThesisMasters Thesis Computer Science 201
Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons
The Jaccard similarity index is an important measure of the overlap of two
sets, widely used in machine learning, computational genomics, information
retrieval, and many other areas. We design and implement SimilarityAtScale, the
first communication-efficient distributed algorithm for computing the Jaccard
similarity among pairs of large datasets. Our algorithm provides an efficient
encoding of this problem into a multiplication of sparse matrices. Both the
encoding and sparse matrix product are performed in a way that minimizes data
movement in terms of communication and synchronization costs. We apply our
algorithm to obtain similarity among all pairs of a set of large samples of
genomes. This task is a key part of modern metagenomics analysis and an
evergrowing need due to the increasing availability of high-throughput DNA
sequencing data. The resulting scheme is the first to enable accurate Jaccard
distance derivations for massive datasets, using largescale distributed-memory
systems. We package our routines in a tool, called GenomeAtScale, that combines
the proposed algorithm with tools for processing input sequences. Our
evaluation on real data illustrates that one can use GenomeAtScale to
effectively employ tens of thousands of processors to reach new frontiers in
large-scale genomic and metagenomic analysis. While GenomeAtScale can be used
to foster DNA research, the more general underlying SimilarityAtScale algorithm
may be used for high-performance distributed similarity computations in other
data analytics application domains
Topological data analysis of contagion maps for examining spreading processes on networks
Social and biological contagions are influenced by the spatial embeddedness
of networks. Historically, many epidemics spread as a wave across part of the
Earth's surface; however, in modern contagions long-range edges -- for example,
due to airline transportation or communication media -- allow clusters of a
contagion to appear in distant locations. Here we study the spread of
contagions on networks through a methodology grounded in topological data
analysis and nonlinear dimension reduction. We construct "contagion maps" that
use multiple contagions on a network to map the nodes as a point cloud. By
analyzing the topology, geometry, and dimensionality of manifold structure in
such point clouds, we reveal insights to aid in the modeling, forecast, and
control of spreading processes. Our approach highlights contagion maps also as
a viable tool for inferring low-dimensional structure in networks.Comment: Main Text and Supplementary Informatio
Parallel Index-Based Structural Graph Clustering and Its Approximation
SCAN (Structural Clustering Algorithm for Networks) is a well-studied, widely
used graph clustering algorithm. For large graphs, however, sequential SCAN
variants are prohibitively slow, and parallel SCAN variants do not effectively
share work among queries with different SCAN parameter settings. Since users of
SCAN often explore many parameter settings to find good clusterings, it is
worthwhile to precompute an index that speeds up queries.
This paper presents a practical and provably efficient parallel index-based
SCAN algorithm based on GS*-Index, a recent sequential algorithm. Our parallel
algorithm improves upon the asymptotic work of the sequential algorithm by
using integer sorting. It is also highly parallel, achieving logarithmic span
(parallel time) for both index construction and clustering queries.
Furthermore, we apply locality-sensitive hashing (LSH) to design a novel
approximate SCAN algorithm and prove guarantees for its clustering behavior.
We present an experimental evaluation of our algorithms on large real-world
graphs. On a 48-core machine with two-way hyper-threading, our parallel index
construction achieves 50--151 speedup over the construction of
GS*-Index. In fact, even on a single thread, our index construction algorithm
is faster than GS*-Index. Our parallel index query implementation achieves
5--32 speedup over GS*-Index queries across a range of SCAN parameter
values, and our implementation is always faster than ppSCAN, a state-of-the-art
parallel SCAN algorithm. Moreover, our experiments show that applying LSH
results in faster index construction while maintaining good clustering quality
End-to-End Entity Resolution for Big Data: A Survey
One of the most important tasks for improving data quality and the
reliability of data analytics results is Entity Resolution (ER). ER aims to
identify different descriptions that refer to the same real-world entity, and
remains a challenging problem. While previous works have studied specific
aspects of ER (and mostly in traditional settings), in this survey, we provide
for the first time an end-to-end view of modern ER workflows, and of the novel
aspects of entity indexing and matching methods in order to cope with more than
one of the Big Data characteristics simultaneously. We present the basic
concepts, processing steps and execution strategies that have been proposed by
different communities, i.e., database, semantic Web and machine learning, in
order to cope with the loose structuredness, extreme diversity, high speed and
large scale of entity descriptions used by real-world applications. Finally, we
provide a synthetic discussion of the existing approaches, and conclude with a
detailed presentation of open research directions
- …