Search CORE

150,156 research outputs found

Unified Framework for Data Mining using Frequent Model Tree

Author: Yumnam Jayanta Singh Shahid Zaman,
Publication venue: Assam Don Bosco University
Publication date: 29/06/2016
Field of study

Abstract: Data mining is the science of discovering hidden patterns from data. Over the past years, a plethora of data mining algorithms has been developed to carry out various data mining tasks such as classification, clustering, association mining and regression. All the methods are ad-hoc in nature, and there exists no unifying framework which unites all the data mining tasks. This study proposes such a framework which describes a data modelling technique to model data in a manner that can be used to accomplish all kinds of data mining tasks. This study proposed a novel algorithm known as Frequent Model (FM)-Growth, based on Frequent pattern (FP)-Growth algorithm. The algorithm is used to find frequent patterns or models from data. These models will then be used to carry out various data mining tasks such as classification, clustering. The advantage of these frequent models is that they can be used as it is with any data mining task irrespective of the nature of the task. The algorithm is carried out in two stages. In the first stage, we grow the FM-tree from the data and in the second stage, we extract the frequent models from the FM-tree. The accuracy of the proposed algorithm is high. However, the algorithm is computationally expensive when searching for frequent models in high volume and high dimensional data. The reason of expensiveness is that it needs to travel all the nodes of a tree. The study suggests measures to be taken to improve the efficiency of the overall process using dictionary data structure.Keywords: Data Mining, Frequent Pattern Recognition Unified Framework, Classification, Clustering, FPGrowth tree

Assam Don Bosco University Journals

Discovery of frequent episodes in event logs

Author: AKA Medeiros de
AKA Medeiros de
CW Günther
D Fahland
FM Maggi
FM Maggi
H Mannila
J Dean
JCAM Buijs
JMEM Werf van der
L Wen
M Kryszkiewicz
M Solé
M Song
R Bergenthum
R Srikant
SJJ Leemans
WMP Aalst van der
WMP Aalst van der
WMP Aalst van der
WMP Aalst van der
X Lu
Publication venue: CEUR-WS.org
Publication date: 01/01/2014
Field of study

Lion's share of process mining research focuses on the discovery of end-to-end process models describing the characteristic behavior of observed cases. The notion of a process instance (i.e., the case) plays an important role in process mining. Pattern mining techniques (such as frequent itemset mining, association rule learning, sequence mining, and traditional episode mining) do not consider process instances. An episode is a collection of partially ordered events. In this paper, we present a new technique (and corresponding implementation) that discovers frequently occurring episodes in event logs thereby exploiting the fact that events are associated with cases. Hence, the work can be positioned in-between process mining and pattern mining. Episode discovery has its applications in, amongst others, discovering local patterns in complex processes and conformance checking based on partial orders. We also discover episode rules to predict behavior and discover correlated behaviors in processes. We have developed a ProM plug-in that exploits efficient algorithms for the discovery of frequent episodes and episode rules. Experimental results based on real-life event logs demonstrate the feasibility and usefulness of the approach

Repository TU/e

Crossref

Pure OAI Repository

Proof Learning in PVS with Utility Pattern Mining

Author: Fournier-Viger Philippe
Nawaz M. Saqib
Zhang Ji
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/06/2020
Field of study

Interactive theorem provers (ITPs) are software tools that allow human users to write and verify formal proofs. In recent years, an emerging research area in ITPs is proof mining, which consists of identifying interesting proof patterns that can be used to guide the interactive proof process in ITPs. In previous studies, some data mining techniques, such as frequent pattern mining, have been used to analyze proofs to find frequent proof steps. Though useful, such models ignore the facts that not all proof steps are equally important. To address this issue, this paper proposes a novel proof mining approach based on finding not only frequent patterns but also high utility patterns to find proof steps of high importance (utility). A proof process learning approach is proposed based on high utility itemset mining (HUIM) for the PVS (Prototype Verification System) proof assistant. Proofs in PVS theories are first abstracted to a computer-processable corpus, where each line represents a proof sequence and proof commands in proof sequences are associated with utilities representing their weightage (importance). HUIM techniques are then applied on the corpus to discover frequent proof steps/high utility patterns and their relationships with each other. Experimental results suggest that combining frequent pattern mining techniques, such as sequential pattern mining and high utility itemset mining, with proof assistants, such as PVS, is useful to learn and guide the proof development process

University of Southern Queensland ePrints

New Approaches to Frequent and Incremental Frequent Pattern Mining

Author: Bicer Mehmet
Publication venue: CUNY Academic Works
Publication date: 01/06/2020
Field of study

Data Mining (DM) is a process for extracting interesting patterns from large volumes of data. It is one of the crucial steps in Knowledge Discovery in Databases (KDD). It involves various data mining methods that mainly fall into predictive and descriptive models. Descriptive models look for patterns, rules, relationships and associations within data. One of the descriptive methods is association rule analysis, which represents co-occurrence of items or events. Association rules are commonly used in market basket analysis. An association rule is in the form of X → Y and it shows that X and Y co-occur with a given level of support and conﬁdence. Association rule mining is a common technique used in discovering interesting frequent patterns in large datasets acquired in various application domains. Having petabytes of data ﬁnding its way into data storages in perhaps every day, made many researchers look for eﬃcient methods for analyzing these large datasets. Many algorithms have been proposed for searching for frequent patterns. The search space combinatorically explodes as the size of the source data increases. Simply using more powerful computers, or even super-computers to handle ever-increasing size of large data sets is not suﬃcient. Hence, incremental algorithms have been developed and used to improve the eﬃciency of frequent pattern mining. One of the challenges of frequent itemset mining is long running times of the algorithms. Two major costs of long running times of frequent itemset mining are due to the number of database scans and the number of candidates generated (the latter one requires memory, and the more the number of candidates there are the more memory space is needed. When the candidates do not ﬁt in memory then page swapping will occur which will increase the running time of the algorithms). In this dissertation we propose a new implementation of Apriori algorithm, NCLAT (Near Candidate-less Apriori with Tidlists), which scans the database only once and creates candidates only for level one (1-itemsets) which is equivalent to the total number of unique items in the database. In addition, we also show the results of choice of data structures used whether they are probabilistic or not, whether the datasets are horizontal or vertical, how counting is done, whether the algorithms are computed single or parallel way. We implement, explore and devise incremental algorithm UWEP with single as well as parallel computation. We have also cleaned a minor bug in UWEP and created a more eﬃcient version UWEP2, which reduces the number of candidates created and the number of database scans. We have run all of our tests against three datasets with diﬀerent features for diﬀerent minimum support levels. We show both frequent and incremental frequent itemset mining implementation test results and comparison to each other. While there has been a lot of work done on frequent itemset mining on structured data, very little work has been done on the unstructured data. So, we have created a new hybrid pattern search algorithm, Double-Hash, which performed better for all of our test scenarios than the known pattern search algorithms. Double-Hash can potentially be used in frequent itemset mining on unstructured data in the future. We will be presenting our work and test results on this as well

City University of New York

GPD: A Graph Pattern Diffusion Kernel for Accurate Graph Classification with Applications in Cheminformatics

Author: Huan Luke
Jia Yi
Lushington Gerald H.
Smalter Hall Aaron
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Graph data mining is an active research area. Graphs are general modeling tools to organize information from heterogeneous sources and have been applied in many scientific, engineering, and business fields. With the fast accumulation of graph data, building highly accurate predictive models for graph data emerges as a new challenge that has not been fully explored in the data mining community. In this paper, we demonstrate a novel technique called graph pattern diffusion (GPD) kernel. Our idea is to leverage existing frequent pattern discovery methods and to explore the application of kernel classifier (e.g., support vector machine) in building highly accurate graph classification. In our method, we first identify all frequent patterns from a graph database. We then map subgraphs to graphs in the graph database and use a process we call “pattern diffusion” to label nodes in the graphs. Finally, we designed a graph alignment algorithm to compute the inner product of two graphs. We have tested our algorithm using a number of chemical structure data. The experimental results demonstrate that our method is significantly better than competing methods such as those kernel functions based on paths, cycles, and subgraphs

KU ScholarWorks

PubMed Central

Effective pattern discovery for text mining

Author: Li Yuefeng
Wu Sheng-Tang
Zhong Ning
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase) based approaches should perform better than the term-based ones, but many experiments did not support this hypothesis. This paper presents an innovative technique, effective pattern discovery which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Substantial experiments on RCV1 data collection and TREC topics demonstrate that the proposed solution achieves encouraging performance

Queensland University of Technology ePrints Archive

A pattern mining approach for information filtering systems

Author: Algarni Abdulmohsen
Li Yuefeng
Xu Yue
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

It is a big challenge to clearly identify the boundary between positive and negative streams for information filtering systems. Several attempts have used negative feedback to solve this challenge; however, there are two issues for using negative relevance feedback to improve the effectiveness of information filtering. The first one is how to select constructive negative samples in order to reduce the space of negative documents. The second issue is how to decide noisy extracted features that should be updated based on the selected negative samples. This paper proposes a pattern mining based approach to select some offenders from the negative documents, where an offender can be used to reduce the side effects of noisy features. It also classifies extracted features (i.e., terms) into three categories: positive specific terms, general terms, and negative specific terms. In this way, multiple revising strategies can be used to update extracted features. An iterative learning algorithm is also proposed to implement this approach on the RCV1 data collection, and substantial experiments show that the proposed approach achieves encouraging performance and the performance is also consistent for adaptive filtering as well

Queensland University of Technology ePrints Archive