63 research outputs found
Mining Top-K Frequent Itemsets Through Progressive Sampling
We study the use of sampling for efficiently mining the top-K frequent
itemsets of cardinality at most w. To this purpose, we define an approximation
to the top-K frequent itemsets to be a family of itemsets which includes
(resp., excludes) all very frequent (resp., very infrequent) itemsets, together
with an estimate of these itemsets' frequencies with a bounded error. Our first
result is an upper bound on the sample size which guarantees that the top-K
frequent itemsets mined from a random sample of that size approximate the
actual top-K frequent itemsets, with probability larger than a specified value.
We show that the upper bound is asymptotically tight when w is constant. Our
main algorithmic contribution is a progressive sampling approach, combined with
suitable stopping conditions, which on appropriate inputs is able to extract
approximate top-K frequent itemsets from samples whose sizes are smaller than
the general upper bound. In order to test the stopping conditions, this
approach maintains the frequency of all itemsets encountered, which is
practical only for small w. However, we show how this problem can be mitigated
by using a variation of Bloom filters. A number of experiments conducted on
both synthetic and real bench- mark datasets show that using samples
substantially smaller than the original dataset (i.e., of size defined by the
upper bound or reached through the progressive sampling approach) enable to
approximate the actual top-K frequent itemsets with accuracy much higher than
what analytically proved.Comment: 16 pages, 2 figures, accepted for presentation at ECML PKDD 2010 and
publication in the ECML PKDD 2010 special issue of the Data Mining and
Knowledge Discovery journa
Optimal Reconstruction of a Sequence From its Probes
An important combinatorial problem, motivated by DNA sequencing in molecular biology, is the reconstruction of a sequence over a small finite alphabet from the collection of its probes (the sequence spectrum), obtained by sliding a fixed sampling pattern over the sequence. Such construction is required for Sequencing-by-Hybridization (SBH), a novel DNA sequencing technique based on an array (SBH chip) of short nucleotide sequences (probes). Once the sequence spectrum is biochemically obtained, a combinatorial method is used to reconstruct the DNA sequence from its spectrum. Since technology limits the number of probes on the SBH chip, a challenging combinatorial question is the design of a smallest set of probes that can sequence an arbitrary DNA string of a given length. We present in this work a novel probe design, crucially based on the use of universal bases (bases that bind to any nucleotide [LB94]) that drastically improves the performance of the SBH process and asymptotically appr..
Dynamic Packet Routing on Arrays with Bounded Buffers
We study the performance of packet routing on arrays (or meshes) with bounded buffers in the routing switches, assuming that new packets are continuously inserted at all the nodes. We give the first routing algorithm on this topology that is stable under an injection rate within a constant factor of the hardware bandwidth. Unlike previous results, our algorithm does not require the global synchronization of the insertion times or the retraction and reinsertion of excessively delayed messages and our analysis holds for a broad range of packet generation stochastic distributions. This result represents a new application of a general technique for the design and analysis of dynamic algorithms that we first presented in [Broder et al., FOCS 96, pp. 390-399.]
Distributed graph diameter approximation
We present an algorithm for approximating the diameter of massive weighted undirected graphs on distributed platforms supporting a MapReduce-like abstraction. In order to be efficient in terms of both time and space, our algorithm is based on a decomposition strategy which partitions the graph into disjoint clusters of bounded radius. Theoretically, our algorithm uses linear space and yields a polylogarithmic approximation guarantee; most importantly, for a large family of graphs, it features a round complexity asymptotically smaller than the one exhibited by a natural approximation algorithm based on the state-of-the-art 06-stepping SSSP algorithm, which is its only practical, linear-space competitor in the distributed setting. We complement our theoretical findings with a proof-of-concept experimental analysis on large benchmark graphs, which suggests that our algorithm may attain substantial improvements in terms of running time compared to the aforementioned competitor, while featuring, in practice, a similar approximation ratio
- …