Search CORE

9 research outputs found

Development of a Classification Rule Mining Framework by Using Temporal Pattern Extraction

Author: Hidenao Abe
Publication venue: 'IntechOpen'
Publication date: 21/01/2011
Field of study

Mining Top-K Frequent Itemsets Through Progressive Sampling

Author: Andrea Pietracaprina
E Cohen
Eli Upfal
Fabio Vandin
J Wang
M Charikar
M Mitzenmacher
Matteo Riondato
RC-W Wong
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/06/2010
Field of study

We study the use of sampling for efficiently mining the top-K frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the top-K frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets' frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the top-K frequent itemsets mined from a random sample of that size approximate the actual top-K frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate top-K frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real bench- mark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual top-K frequent itemsets with accuracy much higher than what analytically proved.Comment: 16 pages, 2 figures, accepted for presentation at ECML PKDD 2010 and publication in the ECML PKDD 2010 special issue of the Data Mining and Knowledge Discovery journa

arXiv.org e-Print Archive

Crossref

PRESS: A Novel Framework of Trajectory Compression in Road Networks

Author: Song Renchu
Sun Weiwei
Zheng Baihua
Zheng Yu
Publication venue
Publication date: 06/02/2014
Field of study

Location data becomes more and more important. In this paper, we focus on the trajectory data, and propose a new framework, namely PRESS (Paralleled Road-Network-Based Trajectory Compression), to effectively compress trajectory data under road network constraints. Different from existing work, PRESS proposes a novel representation for trajectories to separate the spatial representation of a trajectory from the temporal representation, and proposes a Hybrid Spatial Compression (HSC) algorithm and error Bounded Temporal Compression (BTC) algorithm to compress the spatial and temporal information of trajectories respectively. PRESS also supports common spatial-temporal queries without fully decompressing the data. Through an extensive experimental study on real trajectory dataset, PRESS significantly outperforms existing approaches in terms of saving storage cost of trajectory data with bounded errors.Comment: 27 pages, 17 figure

arXiv.org e-Print Archive

Crossref

Institutional Knowledge at Singapore Management University

A Synopsis Based Approach for Itemset Frequency Estimation over Massive Multi-Transaction Stream

Author: Cong G
Hai Z
Wang G
Ye J
Zhang Y
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/12/2021
Field of study

The streams where multiple transactions are associated with the same key are prevalent in practice, e.g., a customer has multiple shopping records arriving at different time. Itemset frequency estimation on such streams is very challenging since sampling based methods, such as the popularly used reservoir sampling, cannot be used. In this article, we propose a novel k-Minimum Value (KMV) synopsis based method to estimate the frequency of itemsets over multi-transaction streams. First, we extract the KMV synopses for each item from the stream. Then, we propose a novel estimator to estimate the frequency of an itemset over the KMV synopses. Comparing to the existing estimator, our method is not only more accurate and efficient to calculate but also follows the downward-closure property. These properties enable the incorporation of our new estimator with existing frequent itemset mining (FIM) algorithm (e.g., FP-Growth) to mine frequent itemsets over multi-transaction streams. To demonstrate this, we implement a KMV synopsis based FIM algorithm by integrating our estimator into existing FIM algorithms, and we prove it is capable of guaranteeing the accuracy of FIM with a bounded size of KMV synopsis. Experimental results on massive streams show our estimator can significantly improve on the accuracy for both estimating itemset frequency and FIM compared to the existing estimators

OPUS - University of Technology Sydney

Mining top-K frequent itemsets from data streams

Author: Ada Wai-Chee Fu
Raymond Chi-Wing Wong
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Advanced pattern mining for complex data analysis

Author: Rong Jia
Publication venue: Deakin University, Faculty of Science and Technology, School of Information Technology
Publication date: 01/08/2012
Field of study

The thesis has researched a set of critical problems in data mining and has proposed four advanced pattern mining algorithm to discover the most interesting and useful data patterns highly relevant to the user’s application targets from the data is represented in complex structures

Deakin Research Online

New Fundamental Technologies in Data Mining

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining

Directory of Open Access Books (DOAB)

DOI 10.1007/s10618-006-0042-x Mining top-K frequent itemsets from data streams

Author: Raymond Chi-wing
Wai-chee Fu
Wong Ada
Publication venue
Publication date
Field of study

Abstract Frequent pattern mining on data streams is of interest recently. However, it is not easy for users to determine a proper frequency threshold. It is more reasonable to ask users to set a bound on the result size. We study the problem of mining top K frequent itemsets in data streams. We introduce a method based on the Chernoff bound with a guarantee of the output quality and also a bound on the memory usage. We also propose an algorithm based on the Lossy Counting Algorithm. In most of the experiments of the two proposed algorithms, we obtain perfect solutions and the memory space occupied by our algorithms is very small. Besides, we also propose the adapted approach of these two algorithms in order to handle the case when we are interested in mining the data in a sliding window. The experiments show that the results are accurate. Keywords Data mining algorithm. Data stream. Top K frequent itemset mining. Sliding window. Chernoff bound. Probabilistic algorithm 1

CiteSeerX