13,544 research outputs found
Efficient Triangle Counting in Large Graphs via Degree-based Vertex Partitioning
The number of triangles is a computationally expensive graph statistic which
is frequently used in complex network analysis (e.g., transitivity ratio), in
various random graph models (e.g., exponential random graph model) and in
important real world applications such as spam detection, uncovering of the
hidden thematic structure of the Web and link recommendation. Counting
triangles in graphs with millions and billions of edges requires algorithms
which run fast, use small amount of space, provide accurate estimates of the
number of triangles and preferably are parallelizable.
In this paper we present an efficient triangle counting algorithm which can
be adapted to the semistreaming model. The key idea of our algorithm is to
combine the sampling algorithm of Tsourakakis et al. and the partitioning of
the set of vertices into a high degree and a low degree subset respectively as
in the Alon, Yuster and Zwick work treating each set appropriately. We obtain a
running time
and an approximation (multiplicative error), where is the number
of vertices, the number of edges and the maximum number of
triangles an edge is contained.
Furthermore, we show how this algorithm can be adapted to the semistreaming
model with space usage and a constant number of passes (three) over the graph
stream. We apply our methods in various networks with several millions of edges
and we obtain excellent results. Finally, we propose a random projection based
method for triangle counting and provide a sufficient condition to obtain an
estimate with low variance.Comment: 1) 12 pages 2) To appear in the 7th Workshop on Algorithms and Models
for the Web Graph (WAW 2010
MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension
Given a dataset of points in a metric space and an integer , a diversity
maximization problem requires determining a subset of points maximizing
some diversity objective measure, e.g., the minimum or the average distance
between two points in the subset. Diversity maximization is computationally
hard, hence only approximate solutions can be hoped for. Although its
applications are mainly in massive data analysis, most of the past research on
diversity maximization focused on the sequential setting. In this work we
present space and pass/round-efficient diversity maximization algorithms for
the Streaming and MapReduce models and analyze their approximation guarantees
for the relevant class of metric spaces of bounded doubling dimension. Like
other approaches in the literature, our algorithms rely on the determination of
high-quality core-sets, i.e., (much) smaller subsets of the input which contain
good approximations to the optimal solution for the whole input. For a variety
of diversity objective functions, our algorithms attain an
-approximation ratio, for any constant , where
is the best approximation ratio achieved by a polynomial-time,
linear-space sequential algorithm for the same diversity objective. This
improves substantially over the approximation ratios attainable in Streaming
and MapReduce by state-of-the-art algorithms for general metric spaces. We
provide extensive experimental evidence of the effectiveness of our algorithms
on both real world and synthetic datasets, scaling up to over a billion points.Comment: Extended version of
http://www.vldb.org/pvldb/vol10/p469-ceccarello.pdf, PVLDB Volume 10, No. 5,
January 201
Online Row Sampling
Finding a small spectral approximation for a tall matrix is
a fundamental numerical primitive. For a number of reasons, one often seeks an
approximation whose rows are sampled from those of . Row sampling improves
interpretability, saves space when is sparse, and preserves row structure,
which is especially important, for example, when represents a graph.
However, correctly sampling rows from can be costly when the matrix is
large and cannot be stored and processed in memory. Hence, a number of recent
publications focus on row sampling in the streaming setting, using little more
space than what is required to store the outputted approximation [KL13,
KLM+14].
Inspired by a growing body of work on online algorithms for machine learning
and data analysis, we extend this work to a more restrictive online setting: we
read rows of one by one and immediately decide whether each row should be
kept in the spectral approximation or discarded, without ever retracting these
decisions. We present an extremely simple algorithm that approximates up to
multiplicative error and additive error using online samples, with memory overhead
proportional to the cost of storing the spectral approximation. We also present
an algorithm that uses ) memory but only requires
samples, which we show is
optimal.
Our methods are clean and intuitive, allow for lower memory usage than prior
work, and expose new theoretical properties of leverage score based matrix
approximation
On Counting Triangles through Edge Sampling in Large Dynamic Graphs
Traditional frameworks for dynamic graphs have relied on processing only the
stream of edges added into or deleted from an evolving graph, but not any
additional related information such as the degrees or neighbor lists of nodes
incident to the edges. In this paper, we propose a new edge sampling framework
for big-graph analytics in dynamic graphs which enhances the traditional model
by enabling the use of additional related information. To demonstrate the
advantages of this framework, we present a new sampling algorithm, called Edge
Sample and Discard (ESD). It generates an unbiased estimate of the total number
of triangles, which can be continuously updated in response to both edge
additions and deletions. We provide a comparative analysis of the performance
of ESD against two current state-of-the-art algorithms in terms of accuracy and
complexity. The results of the experiments performed on real graphs show that,
with the help of the neighborhood information of the sampled edges, the
accuracy achieved by our algorithm is substantially better. We also
characterize the impact of properties of the graph on the performance of our
algorithm by testing on several Barabasi-Albert graphs.Comment: A short version of this article appeared in Proceedings of the 2017
IEEE/ACM International Conference on Advances in Social Networks Analysis and
Mining (ASONAM 2017
FLEET: Butterfly Estimation from a Bipartite Graph Stream
We consider space-efficient single-pass estimation of the number of
butterflies, a fundamental bipartite graph motif, from a massive bipartite
graph stream where each edge represents a connection between entities in two
different partitions. We present a space lower bound for any streaming
algorithm that can estimate the number of butterflies accurately, as well as
FLEET, a suite of algorithms for accurately estimating the number of
butterflies in the graph stream. Estimates returned by the algorithms come with
provable guarantees on the approximation error, and experiments show good
tradeoffs between the space used and the accuracy of approximation. We also
present space-efficient algorithms for estimating the number of butterflies
within a sliding window of the most recent elements in the stream. While there
is a significant body of work on counting subgraphs such as triangles in a
unipartite graph stream, our work seems to be one of the few to tackle the case
of bipartite graph streams.Comment: This is the author's version of the work. It is posted here by
permission of ACM for your personal use. Not for redistribution. The
definitive version was published in Seyed-Vahid Sanei-Mehri, Yu Zhang, Ahmet
Erdem Sariyuce and Srikanta Tirthapura. "FLEET: Butterfly Estimation from a
Bipartite Graph Stream". The 28th ACM International Conference on Information
and Knowledge Managemen
On Graph Stream Clustering with Side Information
Graph clustering becomes an important problem due to emerging applications
involving the web, social networks and bio-informatics. Recently, many such
applications generate data in the form of streams. Clustering massive, dynamic
graph streams is significantly challenging because of the complex structures of
graphs and computational difficulties of continuous data. Meanwhile, a large
volume of side information is associated with graphs, which can be of various
types. The examples include the properties of users in social network
activities, the meta attributes associated with web click graph streams and the
location information in mobile communication networks. Such attributes contain
extremely useful information and has the potential to improve the clustering
process, but are neglected by most recent graph stream mining techniques. In
this paper, we define a unified distance measure on both link structures and
side attributes for clustering. In addition, we propose a novel optimization
framework DMO, which can dynamically optimize the distance metric and make it
adapt to the newly received stream data. We further introduce a carefully
designed statistics SGS(C) which consume constant storage spaces with the
progression of streams. We demonstrate that the statistics maintained are
sufficient for the clustering process as well as the distance optimization and
can be scalable to massive graphs with side attributes. We will present
experiment results to show the advantages of the approach in graph stream
clustering with both links and side information over the baselines.Comment: Full version of SIAM SDM 2013 pape
- …