Search CORE

10 research outputs found

Enumerating Maximal Bicliques from a Large Graph using MapReduce

Author: Mukherjee Arko Provo
Tirthapura Srikanta
Publication venue
Publication date: 01/01/2014
Field of study

We consider the enumeration of maximal bipartite cliques (bicliques) from a large graph, a task central to many practical data mining problems in social network analysis and bioinformatics. We present novel parallel algorithms for the MapReduce platform, and an experimental evaluation using Hadoop MapReduce. Our algorithm is based on clustering the input graph into smaller sized subgraphs, followed by processing different subgraphs in parallel. Our algorithm uses two ideas that enable it to scale to large graphs: (1) the redundancy in work between different subgraph explorations is minimized through a careful pruning of the search space, and (2) the load on different reducers is balanced through the use of an appropriate total order among the vertices. Our evaluation shows that the algorithm scales to large graphs with millions of edges and tens of mil- lions of maximal bicliques. To our knowledge, this is the first work on maximal biclique enumeration for graphs of this scale.Comment: A preliminary version of the paper was accepted at the Proceedings of the 3rd IEEE International Congress on Big Data 201

arXiv.org e-Print Archive

Digital Repository @ Iowa State University (ISU)

Crossref

Mining dense substructures from large deterministic and probabilistic graphs

Author: Mukherjee Arko Provo
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2015
Field of study

Graphs represent relationships. Some relationships can be represented as a deterministic graph while others can only be represented by using probabilities. Mining dense structures from graphs help us to find useful patterns in these relationships having applications in wide areas like social network analysis, bioinformatics etc. Arguably the two most fundamental dense substructures are Maximal Cliques and Maximal Bicliques. The enumeration of both these structures are central to many data mining problems. With the advent of “big data”, real world graphs have become massive. Recently systems like MapReduce have evolved to process such large data. However using these systems to mine dense substrucures in massive graphs is an open question. In this thesis, we present novel parallel algorithms using MapReduce for the enumeration of Maximal Cliques / Bicliques in large graphs. We show that our algorithms are work optimal and load balanced. Further, we present a detailed evaluation which shows that the algorithm scales to large graphs with millions of edges and tens of millions of output structures. Finally we consider the problem of Maximal Clique Enumeration in an Uncertain Graph, which is a probability distribution on a set of deterministic graphs. We define the notion of a maximal clique for an uncertain graph, give matching upper and lower bounds on the number of such structures and present a near optimal algorithm to mine all maximal cliques

Digital Repository @ Iowa State University (ISU)

Identifying Correlated Heavy-Hitters in a Two-Dimensional Data Stream

Author: Lahiri Bibudh
Mukherjee Arko Provo
Tirthapura Srikanta
Publication venue
Publication date: 03/10/2013
Field of study

We consider online mining of correlated heavy-hitters from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an aggregate along a secondary dimension. Prior work on identifying heavy-hitters in streams has almost exclusively focused on identifying heavy-hitters on a single dimensional stream, and these yield little insight into the properties of heavy-hitters along other dimensions. In typical applications however, an analyst is interested not only in identifying heavy-hitters, but also in understanding further properties such as: what other items appear frequently along with a heavy-hitter, or what is the frequency distribution of items that appear along with the heavy-hitters. We consider queries of the following form: In a stream S of (x, y) tuples, on the substream H of all x values that are heavy-hitters, maintain those y values that occur frequently with the x values in H. We call this problem as Correlated Heavy-Hitters (CHH). We formulate an approximate formulation of CHH identification, and present an algorithm for tracking CHHs on a data stream. The algorithm is easy to implement and uses workspace which is orders of magnitude smaller than the stream itself. We present provable guarantees on the maximum error, as well as detailed experimental results that demonstrate the space-accuracy trade-off

arXiv.org e-Print Archive

CiteSeerX

Mining Maximal Cliques from an Uncertain Graph

Author: Mukherjee Arko Provo
Tirthapura Srikanta
Xu Pan
Publication venue
Publication date: 22/10/2014
Field of study

We consider mining dense substructures (maximal cliques) from an uncertain graph, which is a probability distribution on a set of deterministic graphs. For parameter 0 < {\alpha} < 1, we present a precise definition of an {\alpha}-maximal clique in an uncertain graph. We present matching upper and lower bounds on the number of {\alpha}-maximal cliques possible within an uncertain graph. We present an algorithm to enumerate {\alpha}-maximal cliques in an uncertain graph whose worst-case runtime is near-optimal, and an experimental evaluation showing the practical utility of the algorithm.Comment: ICDE 201

arXiv.org e-Print Archive

Digital Repository @ Iowa State University (ISU)

CiteSeerX

Enumerating Maximal Bicliques from a Large Graph Using MapReduce

Author: Mukherjee Arko Provo
Tirthapura Srikanta
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2017
Field of study

We consider the enumeration of maximal bipartite cliques (bicliques) from a large graph, a task central to many data mining problems arising in social network analysis and bioinformatics. We present novel parallel algorithms for the MapReduce framework, and an experimental evaluation using Hadoop MapReduce. Our algorithm is based on clustering the input graph into smaller subgraphs, followed by processing different subgraphs in parallel. Our algorithm uses two ideas that enable it to scale to large graphs: (1) the redundancy in work between different subgraph explorations is minimized through a careful pruning of the search space, and (2) the load on different reducers is balanced through a task assignment that is based on an appropriate total order among the vertices. We show theoretically that our algorithm is work optimal, i.e., it performs the same total work as its sequential counterpart. We present a detailed evaluation which shows that the algorithm scales to large graphs with millions of edges and tens of millions of maximal bicliques. To our knowledge, this is the first work on maximal biclique enumeration for graphs of this scale

Digital Repository @ Iowa State University (ISU)

Crossref

Enumerating Maximal Bicliques from a Large Graph Using MapReduce

Author: Arko Provo Mukherjee
Srikanta Tirthapura
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Mining maximal cliques from a large graph using MapReduce: Tackling highly uneven subproblem sizes

Author: Agarwal
Angel
Arko Provo Mukherjee
Bahmani
Bron
Cazals
Chen
Cheng
Chiba
Cho
Dean
Dean
Eppstein
Ghemawat
Grindley
Gu
Harley
Hattori
Johnson
Jonsson
Koch
Kose
Lawler
Leskovec
Leskovec
Leskovec
Makino
Michael Svendsen
Mohseni-Zadeh
Moon
Palla
Richardson
Rokhlenko
Schmidt
Srikanta Tirthapura
Tomita
Tsukiyama
White
Zaki
Zhang
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref