44 research outputs found

    Interpretations of Association Rules by Granular Computing

    Get PDF
    We present interpretations for association rules. We first introduce Pawlak's method, and the corresponding algorithm of finding decision rules (a kind of association rules). We then use extended random sets to present a new algorithm of finding interesting rules. We prove that the new algorithm is faster than Pawlak's algorithm. The extended random sets are easily to include more than one criterion for determining interesting rules. We also provide two measures for dealing with uncertainties in association rules

    Approximately Minwise Independence with Twisted Tabulation

    Full text link
    A random hash function hh is ε\varepsilon-minwise if for any set SS, S=n|S|=n, and element xSx\in S, Pr[h(x)=minh(S)]=(1±ε)/n\Pr[h(x)=\min h(S)]=(1\pm\varepsilon)/n. Minwise hash functions with low bias ε\varepsilon have widespread applications within similarity estimation. Hashing from a universe [u][u], the twisted tabulation hashing of P\v{a}tra\c{s}cu and Thorup [SODA'13] makes c=O(1)c=O(1) lookups in tables of size u1/cu^{1/c}. Twisted tabulation was invented to get good concentration for hashing based sampling. Here we show that twisted tabulation yields O~(1/u1/c)\tilde O(1/u^{1/c})-minwise hashing. In the classic independence paradigm of Wegman and Carter [FOCS'79] O~(1/u1/c)\tilde O(1/u^{1/c})-minwise hashing requires Ω(logu)\Omega(\log u)-independence [Indyk SODA'99]. P\v{a}tra\c{s}cu and Thorup [STOC'11] had shown that simple tabulation, using same space and lookups yields O~(1/n1/c)\tilde O(1/n^{1/c})-minwise independence, which is good for large sets, but useless for small sets. Our analysis uses some of the same methods, but is much cleaner bypassing a complicated induction argument.Comment: To appear in Proceedings of SWAT 201

    Optimal lower bounds for locality sensitive hashing (except when q is tiny)

    Full text link
    We study lower bounds for Locality Sensitive Hashing (LSH) in the strongest setting: point sets in {0,1}^d under the Hamming distance. Recall that here H is said to be an (r, cr, p, q)-sensitive hash family if all pairs x, y in {0,1}^d with dist(x,y) at most r have probability at least p of collision under a randomly chosen h in H, whereas all pairs x, y in {0,1}^d with dist(x,y) at least cr have probability at most q of collision. Typically, one considers d tending to infinity, with c fixed and q bounded away from 0. For its applications to approximate nearest neighbor search in high dimensions, the quality of an LSH family H is governed by how small its "rho parameter" rho = ln(1/p)/ln(1/q) is as a function of the parameter c. The seminal paper of Indyk and Motwani showed that for each c, the extremely simple family H = {x -> x_i : i in d} achieves rho at most 1/c. The only known lower bound, due to Motwani, Naor, and Panigrahy, is that rho must be at least .46/c (minus o_d(1)). In this paper we show an optimal lower bound: rho must be at least 1/c (minus o_d(1)). This lower bound for Hamming space yields a lower bound of 1/c^2 for Euclidean space (or the unit sphere) and 1/c for the Jaccard distance on sets; both of these match known upper bounds. Our proof is simple; the essence is that the noise stability of a boolean function at e^{-t} is a log-convex function of t.Comment: 9 pages + abstract and reference

    A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

    Full text link
    Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been successfully used for many applications such as similarity search and large scale learning. Its two compressed versions, b-bit MinHash and Odd Sketch, can significantly reduce the memory usage of the original MinHash method, especially for estimating high similarities (i.e., similarities around 1). Although MinHash can be applied to static sets as well as streaming sets, of which elements are given in a streaming fashion and cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd Sketch fail to deal with streaming data. To solve this problem, we design a memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard similarities in streaming sets. Compared to MinHash, our method uses smaller sized registers (each register consists of less than 7 bits) to build a compact sketch for each set. We also provide a simple yet accurate estimator for inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive formulas for bounding the estimation error and determine the smallest necessary memory usage (i.e., the number of registers used for a MaxLogHash sketch) for the desired accuracy. We conduct experiments on a variety of datasets, and experimental results show that our method MaxLogHash is about 5 times more memory efficient than MinHash with the same accuracy and computational cost for estimating high similarities

    Finding Associations and Computing Similarity via Biased Pair Sampling

    Full text link
    This version is ***superseded*** by a full version that can be found at http://www.itu.dk/people/pagh/papers/mining-jour.pdf, which contains stronger theoretical results and fixes a mistake in the reporting of experiments. Abstract: Sampling-based methods have previously been proposed for the problem of finding interesting associations in data, even for low-support items. While these methods do not guarantee precise results, they can be vastly more efficient than approaches that rely on exact counting. However, for many similarity measures no such methods have been known. In this paper we show how a wide variety of measures can be supported by a simple biased sampling method. The method also extends to find high-confidence association rules. We demonstrate theoretically that our method is superior to exact methods when the threshold for "interesting similarity/confidence" is above the average pairwise similarity/confidence, and the average support is not too low. Our method is particularly good when transactions contain many items. We confirm in experiments on standard association mining benchmarks that this gives a significant speedup on real data sets (sometimes much larger than the theoretical guarantees). Reductions in computation time of over an order of magnitude, and significant savings in space, are observed.Comment: This is an extended version of a paper that appeared at the IEEE International Conference on Data Mining, 2009. The conference version is (c) 2009 IEE

    Interval Selection in the Streaming Model

    Full text link
    A set of intervals is independent when the intervals are pairwise disjoint. In the interval selection problem we are given a set I\mathbb{I} of intervals and we want to find an independent subset of intervals of largest cardinality. Let α(I)\alpha(\mathbb{I}) denote the cardinality of an optimal solution. We discuss the estimation of α(I)\alpha(\mathbb{I}) in the streaming model, where we only have one-time, sequential access to the input intervals, the endpoints of the intervals lie in {1,...,n}\{1,...,n \}, and the amount of the memory is constrained. For intervals of different sizes, we provide an algorithm in the data stream model that computes an estimate α^\hat\alpha of α(I)\alpha(\mathbb{I}) that, with probability at least 2/32/3, satisfies 12(1ε)α(I)α^α(I)\tfrac 12(1-\varepsilon) \alpha(\mathbb{I}) \le \hat\alpha \le \alpha(\mathbb{I}). For same-length intervals, we provide another algorithm in the data stream model that computes an estimate α^\hat\alpha of α(I)\alpha(\mathbb{I}) that, with probability at least 2/32/3, satisfies 23(1ε)α(I)α^α(I)\tfrac 23(1-\varepsilon) \alpha(\mathbb{I}) \le \hat\alpha \le \alpha(\mathbb{I}). The space used by our algorithms is bounded by a polynomial in ε1\varepsilon^{-1} and logn\log n. We also show that no better estimations can be achieved using o(n)o(n) bits of storage. We also develop new, approximate solutions to the interval selection problem, where we want to report a feasible solution, that use O(α(I))O(\alpha(\mathbb{I})) space. Our algorithms for the interval selection problem match the optimal results by Emek, Halld{\'o}rsson and Ros{\'e}n [Space-Constrained Interval Selection, ICALP 2012], but are much simpler.Comment: Minor correction
    corecore