9,648 research outputs found
Efficient Compression Technique for Sparse Sets
Recent technological advancements have led to the generation of huge amounts
of data over the web, such as text, image, audio and video. Most of this data
is high dimensional and sparse, for e.g., the bag-of-words representation used
for representing text. Often, an efficient search for similar data points needs
to be performed in many applications like clustering, nearest neighbour search,
ranking and indexing. Even though there have been significant increases in
computational power, a simple brute-force similarity-search on such datasets is
inefficient and at times impossible. Thus, it is desirable to get a compressed
representation which preserves the similarity between data points. In this
work, we consider the data points as sets and use Jaccard similarity as the
similarity measure. Compression techniques are generally evaluated on the
following parameters --1) Randomness required for compression, 2) Time required
for compression, 3) Dimension of the data after compression, and 4) Space
required to store the compressed data. Ideally, the compressed representation
of the data should be such, that the similarity between each pair of data
points is preserved, while keeping the time and the randomness required for
compression as low as possible.
We show that the compression technique suggested by Pratap and Kulkarni also
works well for Jaccard similarity. We present a theoretical proof of the same
and complement it with rigorous experimentations on synthetic as well as
real-world datasets. We also compare our results with the state-of-the-art
"min-wise independent permutation", and show that our compression algorithm
achieves almost equal accuracy while significantly reducing the compression
time and the randomness
Efficiently Clustering Very Large Attributed Graphs
Attributed graphs model real networks by enriching their nodes with
attributes accounting for properties. Several techniques have been proposed for
partitioning these graphs into clusters that are homogeneous with respect to
both semantic attributes and to the structure of the graph. However, time and
space complexities of state of the art algorithms limit their scalability to
medium-sized graphs. We propose SToC (for Semantic-Topological Clustering), a
fast and scalable algorithm for partitioning large attributed graphs. The
approach is robust, being compatible both with categorical and with
quantitative attributes, and it is tailorable, allowing the user to weight the
semantic and topological components. Further, the approach does not require the
user to guess in advance the number of clusters. SToC relies on well known
approximation techniques such as bottom-k sketches, traditional graph-theoretic
concepts, and a new perspective on the composition of heterogeneous distance
measures. Experimental results demonstrate its ability to efficiently compute
high-quality partitions of large scale attributed graphs.Comment: This work has been published in ASONAM 2017. This version includes an
appendix with validation of our attribute model and distance function,
omitted in the converence version for lack of space. Please refer to the
published versio
Early Accurate Results for Advanced Analytics on MapReduce
Approximate results based on samples often provide the only way in which
advanced analytical applications on very massive data sets can satisfy their
time and resource constraints. Unfortunately, methods and tools for the
computation of accurate early results are currently not supported in
MapReduce-oriented systems although these are intended for `big data'.
Therefore, we proposed and implemented a non-parametric extension of Hadoop
which allows the incremental computation of early results for arbitrary
work-flows, along with reliable on-line estimates of the degree of accuracy
achieved so far in the computation. These estimates are based on a technique
called bootstrapping that has been widely employed in statistics and can be
applied to arbitrary functions and data distributions. In this paper, we
describe our Early Accurate Result Library (EARL) for Hadoop that was designed
to minimize the changes required to the MapReduce framework. Various tests of
EARL of Hadoop are presented to characterize the frequent situations where EARL
can provide major speed-ups over the current version of Hadoop.Comment: VLDB201
- …