74 research outputs found
Streaming Similarity Self-Join
We introduce and study the problem of computing the similarity self-join in a
streaming context (SSSJ), where the input is an unbounded stream of items
arriving continuously. The goal is to find all pairs of items in the stream
whose similarity is greater than a given threshold. The simplest formulation of
the problem requires unbounded memory, and thus, it is intractable. To make the
problem feasible, we introduce the notion of time-dependent similarity: the
similarity of two items decreases with the difference in their arrival time. By
leveraging the properties of this time-dependent similarity function, we design
two algorithmic frameworks to solve the sssj problem. The first one, MiniBatch
(MB), uses existing index-based filtering techniques for the static version of
the problem, and combines them in a pipeline. The second framework, Streaming
(STR), adds time filtering to the existing indexes, and integrates new
time-based bounds deeply in the working of the algorithms. We also introduce a
new indexing technique (L2), which is based on an existing state-of-the-art
indexing technique (L2AP), but is optimized for the streaming case. Extensive
experiments show that the STR algorithm, when instantiated with the L2 index,
is the most scalable option across a wide array of datasets and parameters
Sequential Hypothesis Tests for Adaptive Locality Sensitive Hashing
All pairs similarity search is a problem where a set of data objects is given
and the task is to find all pairs of objects that have similarity above a
certain threshold for a given similarity measure-of-interest. When the number
of points or dimensionality is high, standard solutions fail to scale
gracefully. Approximate solutions such as Locality Sensitive Hashing (LSH) and
its Bayesian variants (BayesLSH and BayesLSHLite) alleviate the problem to some
extent and provides substantial speedup over traditional index based
approaches. BayesLSH is used for pruning the candidate space and computation of
approximate similarity, whereas BayesLSHLite can only prune the candidates, but
similarity needs to be computed exactly on the original data. Thus where ever
the explicit data representation is available and exact similarity computation
is not too expensive, BayesLSHLite can be used to aggressively prune candidates
and provide substantial speedup without losing too much on quality. However,
the loss in quality is higher in the BayesLSH variant, where explicit data
representation is not available, rather only a hash sketch is available and
similarity has to be estimated approximately. In this work we revisit the LSH
problem from a Frequentist setting and formulate sequential tests for composite
hypothesis (similarity greater than or less than threshold) that can be
leveraged by such LSH algorithms for adaptively pruning candidates
aggressively. We propose a vanilla sequential probability ration test (SPRT)
approach based on this idea and two novel variants. We extend these variants to
the case where approximate similarity needs to be computed using fixed-width
sequential confidence interval generation technique
Finding Associations and Computing Similarity via Biased Pair Sampling
This version is ***superseded*** by a full version that can be found at
http://www.itu.dk/people/pagh/papers/mining-jour.pdf, which contains stronger
theoretical results and fixes a mistake in the reporting of experiments.
Abstract: Sampling-based methods have previously been proposed for the
problem of finding interesting associations in data, even for low-support
items. While these methods do not guarantee precise results, they can be vastly
more efficient than approaches that rely on exact counting. However, for many
similarity measures no such methods have been known. In this paper we show how
a wide variety of measures can be supported by a simple biased sampling method.
The method also extends to find high-confidence association rules. We
demonstrate theoretically that our method is superior to exact methods when the
threshold for "interesting similarity/confidence" is above the average pairwise
similarity/confidence, and the average support is not too low. Our method is
particularly good when transactions contain many items. We confirm in
experiments on standard association mining benchmarks that this gives a
significant speedup on real data sets (sometimes much larger than the
theoretical guarantees). Reductions in computation time of over an order of
magnitude, and significant savings in space, are observed.Comment: This is an extended version of a paper that appeared at the IEEE
International Conference on Data Mining, 2009. The conference version is (c)
2009 IEE
An Efficient Approach for Finding Near Duplicate Web pages using Minimum Weight Overlapping Method
The existence of billions of web data has severely affected the performance and reliability of web search. The presence of near duplicate web pages plays an important role in this performance degradation while integrating data from heterogeneous sources. Web mining faces huge problems due to the existence of such documents. These pages increase the index storage space and thereby increase the serving cost. By introducing efficient methods to detect and remove such documents from the Web not only decreases the computation time but also increases the relevancy of search results. We aim a novel idea for finding near duplicate web pages which can be incorporated in the field of plagiarism detection, spam detection and focused web crawling scenarios. Here we propose an efficient method for finding near duplicates of an input web page, from a huge repository. A TDW matrix based algorithm is proposed with three phases, rendering, filtering and verification, which receives an input web page and a threshold in its first phase, prefix filtering and positional filtering to reduce the size of record set in the second phase and returns an optimal set of near duplicate web pages in the verification phase by using Minimum Weight Overlapping (MWO) method. The experimental results show that our algorithm outperforms in terms of two benchmark measures, precision and recall, and a reduction in the size of competing record set.DOI:http://dx.doi.org/10.11591/ijece.v1i2.7
- …