29,523 research outputs found

    A Tight Upper Bound on the Number of Candidate Patterns

    Full text link
    In the context of mining for frequent patterns using the standard levelwise algorithm, the following question arises: given the current level and the current set of frequent patterns, what is the maximal number of candidate patterns that can be generated on the next level? We answer this question by providing a tight upper bound, derived from a combinatorial result from the sixties by Kruskal and Katona. Our result is useful to reduce the number of database scans

    On the Minimum/Stopping Distance of Array Low-Density Parity-Check Codes

    Get PDF
    In this work, we study the minimum/stopping distance of array low-density parity-check (LDPC) codes. An array LDPC code is a quasi-cyclic LDPC code specified by two integers q and m, where q is an odd prime and m <= q. In the literature, the minimum/stopping distance of these codes (denoted by d(q,m) and h(q,m), respectively) has been thoroughly studied for m <= 5. Both exact results, for small values of q and m, and general (i.e., independent of q) bounds have been established. For m=6, the best known minimum distance upper bound, derived by Mittelholzer (IEEE Int. Symp. Inf. Theory, Jun./Jul. 2002), is d(q,6) <= 32. In this work, we derive an improved upper bound of d(q,6) <= 20 and a new upper bound d(q,7) <= 24 by using the concept of a template support matrix of a codeword/stopping set. The bounds are tight with high probability in the sense that we have not been able to find codewords of strictly lower weight for several values of q using a minimum distance probabilistic algorithm. Finally, we provide new specific minimum/stopping distance results for m <= 7 and low-to-moderate values of q <= 79.Comment: To appear in IEEE Trans. Inf. Theory. The material in this paper was presented in part at the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, June/July 201

    Efficient Subgraph Similarity Search on Large Probabilistic Graph Databases

    Full text link
    Many studies have been conducted on seeking the efficient solution for subgraph similarity search over certain (deterministic) graphs due to its wide application in many fields, including bioinformatics, social network analysis, and Resource Description Framework (RDF) data management. All these works assume that the underlying data are certain. However, in reality, graphs are often noisy and uncertain due to various factors, such as errors in data extraction, inconsistencies in data integration, and privacy preserving purposes. Therefore, in this paper, we study subgraph similarity search on large probabilistic graph databases. Different from previous works assuming that edges in an uncertain graph are independent of each other, we study the uncertain graphs where edges' occurrences are correlated. We formally prove that subgraph similarity search over probabilistic graphs is #P-complete, thus, we employ a filter-and-verify framework to speed up the search. In the filtering phase,we develop tight lower and upper bounds of subgraph similarity probability based on a probabilistic matrix index, PMI. PMI is composed of discriminative subgraph features associated with tight lower and upper bounds of subgraph isomorphism probability. Based on PMI, we can sort out a large number of probabilistic graphs and maximize the pruning capability. During the verification phase, we develop an efficient sampling algorithm to validate the remaining candidates. The efficiency of our proposed solutions has been verified through extensive experiments.Comment: VLDB201

    Tight Bounds for the Cover Times of Random Walks with Heterogeneous Step Lengths

    Get PDF
    Search patterns of randomly oriented steps of different lengths have been observed on all scales of the biological world, ranging from the microscopic to the ecological, including in protein motors, bacteria, T-cells, honeybees, marine predators, and more. Through different models, it has been demonstrated that adopting a variety in the magnitude of the step lengths can greatly improve the search efficiency. However, the precise connection between the search efficiency and the number of step lengths in the repertoire of the searcher has not been identified. Motivated by biological examples in one-dimensional terrains, a recent paper studied the best cover time on an n-node cycle that can be achieved by a random walk process that uses k step lengths. By tuning the lengths and corresponding probabilities the authors therein showed that the best cover time is roughly n 1+Θ(1/k). While this bound is useful for large values of k, it is hardly informative for small k values, which are of interest in biology. In this paper, we provide a tight bound for the cover time of such a walk, for every integer k > 1. Specifically, up to lower order polylogarithmic factors, the upper bound on the cover time is a polynomial in n of exponent 1+ 1/(2k−1). For k = 2, 3, 4 and 5 the exponent is thus 4/3 , 6/5 , 8/7 , and 10/9 , respectively. Informally, our result implies that, as long as the number of step lengths k is not too large, incorporating an additional step length to the repertoire of the process enables to improve the cover time by a polynomial factor, but the extent of the improvement gradually decreases with k

    Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees

    Full text link
    The tasks of extracting (top-KK) Frequent Itemsets (FI's) and Association Rules (AR's) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI's and AR's are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies both to absolute and to relative approximations of (top-KK) FI's and AR's. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a proof that the VC-dimension of this range space is upper bounded by an easy-to-compute characteristic quantity of the dataset which we call \emph{d-index}, and is the maximum integer dd such that the dataset contains at least dd transactions of length at least dd such that no one of them is a superset of or equal to another. We show that this bound is strict for a large class of datasets.Comment: 19 pages, 7 figures. A shorter version of this paper appeared in the proceedings of ECML PKDD 201
    • …
    corecore