Search CORE

63 research outputs found

Mining Top-K Frequent Itemsets Through Progressive Sampling

Author: Andrea Pietracaprina
E Cohen
Eli Upfal
Fabio Vandin
J Wang
M Charikar
M Mitzenmacher
Matteo Riondato
RC-W Wong
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/06/2010
Field of study

We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real bench- mark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved.Comment: 16 pages, 2 figures, accepted for presentation at ECML PKDD 2010 and publication in the ECML PKDD 2010 special issue of the Data Mining and Knowledge Discovery journa

arXiv.org e-Print Archive

Crossref

Evidence for adverse reproductive outcomes among women microelectronic assembly workers.

Author: M J Upfal
S Pinney
Publication venue: 'BMJ'
Publication date
Field of study

Crossref

A review of hydrofluoric acid burn management

Author: Edelman P.
Upfal M.
Yolken R.
Publication venue: 'SAGE Publications'
Publication date
Field of study

Crossref

Optimal Reconstruction of a Sequence From its Probes

Author: Alan M. Frieze
Eli Upfal
Franco P. Preparata
Publication venue
Publication date: 01/01/1999
Field of study

An important combinatorial problem, motivated by DNA sequencing in molecular biology, is the reconstruction of a sequence over a small finite alphabet from the collection of its probes (the sequence spectrum), obtained by sliding a fixed sampling pattern over the sequence. Such construction is required for Sequencing-by-Hybridization (SBH), a novel DNA sequencing technique based on an array (SBH chip) of short nucleotide sequences (probes). Once the sequence spectrum is biochemically obtained, a combinatorial method is used to reconstruct the DNA sequence from its spectrum. Since technology limits the number of probes on the SBH chip, a challenging combinatorial question is the design of a smallest set of probes that can sequence an arbitrary DNA string of a given length. We present in this work a novel probe design, crucially based on the use of universal bases (bases that bind to any nucleotide [LB94]) that drastically improves the performance of the SBH process and asymptotically appr..

CiteSeerX

Dynamic Packet Routing on Arrays with Bounded Buffers

Author: Alan M. Frieze
Andrei Z. Broder
Eli Upfal
Publication venue
Publication date: 01/01/1998
Field of study

We study the performance of packet routing on arrays (or meshes) with bounded buffers in the routing switches, assuming that new packets are continuously inserted at all the nodes. We give the first routing algorithm on this topology that is stable under an injection rate within a constant factor of the hardware bandwidth. Unlike previous results, our algorithm does not require the global synchronization of the insertion times or the retraction and reinsertion of excessively delayed messages and our analysis holds for a broad range of packet generation stochastic distributions. This result represents a new application of a general technique for the design and analysis of dynamic algorithms that we first presented in [Broder et al., FOCS 96, pp. 390-399.]

CiteSeerX

Distributed graph diameter approximation

Author: Ceccarello M.
Pietracaprina A.
Pucci G.
Upfal E.
Publication venue: 'MDPI AG'
Publication date: 01/01/2020
Field of study

We present an algorithm for approximating the diameter of massive weighted undirected graphs on distributed platforms supporting a MapReduce-like abstraction. In order to be efficient in terms of both time and space, our algorithm is based on a decomposition strategy which partitions the graph into disjoint clusters of bounded radius. Theoretically, our algorithm uses linear space and yields a polylogarithmic approximation guarantee; most importantly, for a large family of graphs, it features a round complexity asymptotically smaller than the one exhibited by a natural approximation algorithm based on the state-of-the-art 06-stepping SSSP algorithm, which is its only practical, linear-space competitor in the distributed setting. We complement our theoretical findings with a proof-of-concept experimental analysis on large benchmark graphs, which suggests that our algorithm may attain substantial improvements in terms of running time compared to the aforementioned competitor, while featuring, in practice, a similar approximation ratio

Archivio istituzionale della ricerca - Università di Padova