34,228 research outputs found

    Entropy, dimension and the Elton-Pajor Theorem

    Full text link
    The Vapnik-Chervonenkis dimension of a set K in R^n is the maximal dimension of the coordinate cube of a given size, which can be found in coordinate projections of K. We show that the VC dimension of a convex body governs its entropy. This has a number of consequences, including the optimal Elton's theorem and a uniform central limit theorem in the real valued case

    Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees

    Full text link
    The tasks of extracting (top-KK) Frequent Itemsets (FI's) and Association Rules (AR's) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI's and AR's are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of \emph{Vapnik-Chervonenkis (VC) dimension} to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies both to absolute and to relative approximations of (top-KK) FI's and AR's. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a proof that the VC-dimension of this range space is upper bounded by an easy-to-compute characteristic quantity of the dataset which we call \emph{d-index}, and is the maximum integer dd such that the dataset contains at least dd transactions of length at least dd such that no one of them is a superset of or equal to another. We show that this bound is strict for a large class of datasets.Comment: 19 pages, 7 figures. A shorter version of this paper appeared in the proceedings of ECML PKDD 201

    Bounding Embeddings of VC Classes into Maximum Classes

    Full text link
    One of the earliest conjectures in computational learning theory-the Sample Compression conjecture-asserts that concept classes (equivalently set systems) admit compression schemes of size linear in their VC dimension. To-date this statement is known to be true for maximum classes---those that possess maximum cardinality for their VC dimension. The most promising approach to positively resolving the conjecture is by embedding general VC classes into maximum classes without super-linear increase to their VC dimensions, as such embeddings would extend the known compression schemes to all VC classes. We show that maximum classes can be characterised by a local-connectivity property of the graph obtained by viewing the class as a cubical complex. This geometric characterisation of maximum VC classes is applied to prove a negative embedding result which demonstrates VC-d classes that cannot be embedded in any maximum class of VC dimension lower than 2d. On the other hand, we show that every VC-d class C embeds in a VC-(d+D) maximum class where D is the deficiency of C, i.e., the difference between the cardinalities of a maximum VC-d class and of C. For VC-2 classes in binary n-cubes for 4 <= n <= 6, we give best possible results on embedding into maximum classes. For some special classes of Boolean functions, relationships with maximum classes are investigated. Finally we give a general recursive procedure for embedding VC-d classes into VC-(d+k) maximum classes for smallest k.Comment: 22 pages, 2 figure

    String Matching: Communication, Circuits, and Learning

    Get PDF
    String matching is the problem of deciding whether a given n-bit string contains a given k-bit pattern. We study the complexity of this problem in three settings. - Communication complexity. For small k, we provide near-optimal upper and lower bounds on the communication complexity of string matching. For large k, our bounds leave open an exponential gap; we exhibit some evidence for the existence of a better protocol. - Circuit complexity. We present several upper and lower bounds on the size of circuits with threshold and DeMorgan gates solving the string matching problem. Similarly to the above, our bounds are near-optimal for small k. - Learning. We consider the problem of learning a hidden pattern of length at most k relative to the classifier that assigns 1 to every string that contains the pattern. We prove optimal bounds on the VC dimension and sample complexity of this problem

    Space and Time Efficient Parallel Graph Decomposition, Clustering, and Diameter Approximation

    Full text link
    We develop a novel parallel decomposition strategy for unweighted, undirected graphs, based on growing disjoint connected clusters from batches of centers progressively selected from yet uncovered nodes. With respect to similar previous decompositions, our strategy exercises a tighter control on both the number of clusters and their maximum radius. We present two important applications of our parallel graph decomposition: (1) kk-center clustering approximation; and (2) diameter approximation. In both cases, we obtain algorithms which feature a polylogarithmic approximation factor and are amenable to a distributed implementation that is geared for massive (long-diameter) graphs. The total space needed for the computation is linear in the problem size, and the parallel depth is substantially sublinear in the diameter for graphs with low doubling dimension. To the best of our knowledge, ours are the first parallel approximations for these problems which achieve sub-diameter parallel time, for a relevant class of graphs, using only linear space. Besides the theoretical guarantees, our algorithms allow for a very simple implementation on clustered architectures: we report on extensive experiments which demonstrate their effectiveness and efficiency on large graphs as compared to alternative known approaches.Comment: 14 page
    • …
    corecore