295,900 research outputs found
Sampling Large Data on Graphs
We consider the problem of sampling from data defined on the nodes of a
weighted graph, where the edge weights capture the data correlation structure.
As shown recently, using spectral graph theory one can define a cut-off
frequency for the bandlimited graph signals that can be reconstructed from a
given set of samples (i.e., graph nodes). In this work, we show how this
cut-off frequency can be computed exactly. Using this characterization, we
provide efficient algorithms for finding the subset of nodes of a given size
with the largest cut-off frequency and for finding the smallest subset of nodes
with a given cut-off frequency. In addition, we study the performance of random
uniform sampling when compared to the centralized optimal sampling provided by
the proposed algorithms.Comment: To be presented at GlobalSIP 201
Efficient Subgraph Similarity Search on Large Probabilistic Graph Databases
Many studies have been conducted on seeking the efficient solution for
subgraph similarity search over certain (deterministic) graphs due to its wide
application in many fields, including bioinformatics, social network analysis,
and Resource Description Framework (RDF) data management. All these works
assume that the underlying data are certain. However, in reality, graphs are
often noisy and uncertain due to various factors, such as errors in data
extraction, inconsistencies in data integration, and privacy preserving
purposes. Therefore, in this paper, we study subgraph similarity search on
large probabilistic graph databases. Different from previous works assuming
that edges in an uncertain graph are independent of each other, we study the
uncertain graphs where edges' occurrences are correlated. We formally prove
that subgraph similarity search over probabilistic graphs is #P-complete, thus,
we employ a filter-and-verify framework to speed up the search. In the
filtering phase,we develop tight lower and upper bounds of subgraph similarity
probability based on a probabilistic matrix index, PMI. PMI is composed of
discriminative subgraph features associated with tight lower and upper bounds
of subgraph isomorphism probability. Based on PMI, we can sort out a large
number of probabilistic graphs and maximize the pruning capability. During the
verification phase, we develop an efficient sampling algorithm to validate the
remaining candidates. The efficiency of our proposed solutions has been
verified through extensive experiments.Comment: VLDB201
Parallel Sampling-Pipeline for Indefinite Stream of Heterogeneous Graphs using OpenCL for FPGAs
In the field of data science, a huge amount of data, generally represented as graphs, needs to be processed and analyzed. It is of utmost importance that this data be processed swiftly and efficiently to save time and energy. The volume and velocity of data, along with irregular access patterns in graph data structures, pose challenges in terms of analysis and processing. Further, a big chunk of time and energy is spent on analyzing these graphs on large compute clusters and/or data-centers. Filtering and refining of data using graph sampling techniques are one of the most effective ways to speed up the analysis. Efficient accelerators, such as FPGAs, have proven to significantly lower the energy cost of running an algorithm. To this end, we present the design and implementation of a parallel graph sampling technique, for a large number of input graphs streaming into a FPGA. A parallel approach using OpenCL for FPGAs was adopted to come up with a solution that is both time- and energyefficient. We introduce a novel graph data structure, suitable for streaming graphs on FPGAs, that allows time- and memory-efficient representation of graphs. Our experiments show that our proposed technique is 3x faster and 2x more energy efficient as compared to serial CPU version of the algorithm
FS^3: A Sampling based method for top-k Frequent Subgraph Mining
Mining labeled subgraph is a popular research task in data mining because of
its potential application in many different scientific domains. All the
existing methods for this task explicitly or implicitly solve the subgraph
isomorphism task which is computationally expensive, so they suffer from the
lack of scalability problem when the graphs in the input database are large. In
this work, we propose FS^3, which is a sampling based method. It mines a small
collection of subgraphs that are most frequent in the probabilistic sense. FS^3
performs a Markov Chain Monte Carlo (MCMC) sampling over the space of a
fixed-size subgraphs such that the potentially frequent subgraphs are sampled
more often. Besides, FS^3 is equipped with an innovative queue manager. It
stores the sampled subgraph in a finite queue over the course of mining in such
a manner that the top-k positions in the queue contain the most frequent
subgraphs. Our experiments on database of large graphs show that FS^3 is
efficient, and it obtains subgraphs that are the most frequent amongst the
subgraphs of a given size
- …