34 research outputs found
Mining XML documents with association rule algorithms
Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2008Includes bibliographical references (leaves: 59-63)Text in English; Abstract: Turkish and Englishx, 63 leavesFollowing the increasing use of XML technology for data storage and data exchange between applications, the subject of mining XML documents has become more researchable and important topic. In this study, we considered the problem of Mining Association Rules between items in XML document. The principal purpose of this study is applying association rule algorithms directly to the XML documents with using XQuery which is a functional expression language that can be used to query or process XML data. We used three different algorithms; Apriori, AprioriTid and High Efficient AprioriTid. We give comparisons of mining times of these three apriori-like algorithms on XML documents using different support levels, different datasets and different dataset sizes
Mining fuzzy association rules in large databases with quantitative attributes.
by Kuok, Chan Man.Thesis (M.Phil.)--Chinese University of Hong Kong, 1997.Includes bibliographical references (leaves 74-77).Abstract --- p.iAcknowledgments --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Data Mining --- p.2Chapter 1.2 --- Association Rule Mining --- p.3Chapter 2 --- Background --- p.6Chapter 2.1 --- Framework of Association Rule Mining --- p.6Chapter 2.1.1 --- Large Itemsets --- p.6Chapter 2.1.2 --- Association Rules --- p.8Chapter 2.2 --- Association Rule Algorithms For Binary Attributes --- p.11Chapter 2.2.1 --- AIS --- p.12Chapter 2.2.2 --- SETM --- p.13Chapter 2.2.3 --- "Apriori, AprioriTid and AprioriHybrid" --- p.15Chapter 2.2.4 --- PARTITION --- p.18Chapter 2.3 --- Association Rule Algorithms For Numeric Attributes --- p.20Chapter 2.3.1 --- Quantitative Association Rules --- p.20Chapter 2.3.2 --- Optimized Association Rules --- p.23Chapter 3 --- Problem Definition --- p.25Chapter 3.1 --- Handling Quantitative Attributes --- p.25Chapter 3.1.1 --- Discrete intervals --- p.26Chapter 3.1.2 --- Overlapped intervals --- p.27Chapter 3.1.3 --- Fuzzy sets --- p.28Chapter 3.2 --- Fuzzy association rule --- p.31Chapter 3.3 --- Significance factor --- p.32Chapter 3.4 --- Certainty factor --- p.36Chapter 3.4.1 --- Using significance --- p.37Chapter 3.4.2 --- Using correlation --- p.38Chapter 3.4.3 --- Significance vs. Correlation --- p.42Chapter 4 --- Steps For Mining Fuzzy Association Rules --- p.43Chapter 4.1 --- Candidate itemsets generation --- p.44Chapter 4.1.1 --- Candidate 1-Itemsets --- p.45Chapter 4.1.2 --- Candidate k-Itemsets (k > 1) --- p.47Chapter 4.2 --- Large itemsets generation --- p.48Chapter 4.3 --- Fuzzy association rules generation --- p.49Chapter 5 --- Experimental Results --- p.51Chapter 5.1 --- Experiment One --- p.51Chapter 5.2 --- Experiment Two --- p.53Chapter 5.3 --- Experiment Three --- p.54Chapter 5.4 --- Experiment Four --- p.56Chapter 5.5 --- Experiment Five --- p.58Chapter 5.5.1 --- Number of Itemsets --- p.58Chapter 5.5.2 --- Number of Rules --- p.60Chapter 5.6 --- Experiment Six --- p.61Chapter 5.6.1 --- Varying Significance Threshold --- p.62Chapter 5.6.2 --- Varying Membership Threshold --- p.62Chapter 5.6.3 --- Varying Confidence Threshold --- p.63Chapter 6 --- Discussions --- p.65Chapter 6.1 --- User guidance --- p.65Chapter 6.2 --- Rule understanding --- p.67Chapter 6.3 --- Number of rules --- p.68Chapter 7 --- Conclusions and Future Works --- p.70Bibliography --- p.7
Recommended from our members
Enhancing Fuzzy Associative Rule Mining Approaches for Improving Prediction Accuracy. Integration of Fuzzy Clustering, Apriori and Multiple Support Approaches to Develop an Associative Classification Rule Base
Building an accurate and reliable model for prediction for different application domains, is one of the most significant challenges in knowledge discovery and data mining. This thesis focuses on building and enhancing a generic predictive model for estimating a future value by extracting association rules (knowledge) from a quantitative database. This model is applied to several data sets obtained from different benchmark problems, and the results are evaluated through extensive experimental tests.
The thesis presents an incremental development process for the prediction model with three stages. Firstly, a Knowledge Discovery (KD) model is proposed by integrating Fuzzy C-Means (FCM) with Apriori approach to extract Fuzzy Association Rules (FARs) from a database for building a Knowledge Base (KB) to predict a future value. The KD model has been tested with two road-traffic data sets.
Secondly, the initial model has been further developed by including a diversification method in order to improve a reliable FARs to find out the best and representative rules. The resulting Diverse Fuzzy Rule Base (DFRB) maintains high quality and diverse FARs offering a more reliable and generic model. The model uses FCM to transform quantitative data into fuzzy ones, while a Multiple Support Apriori (MSapriori) algorithm is adapted to extract the FARs from fuzzy data. The correlation values for these FARs are calculated, and an efficient orientation for filtering FARs is performed as a post-processing method. The FARs diversity is maintained through the clustering of FARs, based on the concept of the sharing function technique used in multi-objectives optimization. The best and the most diverse FARs are obtained as the DFRB to utilise within the Fuzzy Inference System (FIS) for prediction.
The third stage of development proposes a hybrid prediction model called Fuzzy Associative Classification Rule Mining (FACRM) model. This model integrates the
ii
improved Gustafson-Kessel (G-K) algorithm, the proposed Fuzzy Associative Classification Rules (FACR) algorithm and the proposed diversification method. The improved G-K algorithm transforms quantitative data into fuzzy data, while the FACR generate significant rules (Fuzzy Classification Association Rules (FCARs)) by employing the improved multiple support threshold, associative classification and vertical scanning format approaches. These FCARs are then filtered by calculating the correlation value and the distance between them. The advantage of the proposed FACRM model is to build a generalized prediction model, able to deal with different application domains. The validation of the FACRM model is conducted using different benchmark data sets from the University of California, Irvine (UCI) of machine learning and KEEL (Knowledge Extraction based on Evolutionary Learning) repositories, and the results of the proposed FACRM are also compared with other existing prediction models. The experimental results show that the error rate and generalization performance of the proposed model is better in the majority of data sets with respect to the commonly used models.
A new method for feature selection entitled Weighting Feature Selection (WFS) is also proposed. The WFS method aims to improve the performance of FACRM model. The prediction performance is improved by minimizing the prediction error and reducing the number of generated rules. The prediction results of FACRM by employing WFS have been compared with that of FACRM and Stepwise Regression (SR) models for different data sets. The performance analysis and comparative study show that the proposed prediction model provides an effective approach that can be used within a decision support system.Applied Science University (ASU) of Jorda
A Study on Data Filtering Techniques for Event-Driven Failure Analysis
Engineering & Systems DesignHigh performance sensors and modern data logging technology with real-time telemetry facilitate system failure analysis in a very precise manner. Fault detection, isolation and identification in failure analysis are typical steps to analyze the root causes of failures. This systematic failure analysis provides not only useful clues to rectify the abnormal behaviors of a system, but also key information to redesign the current system for retrofit. The main barriers to effective failure analysis are: (i) the gathered sensor data logs, usually in the form of event logs containing massive datasets, are too large, and further (ii) noise and redundant information in the gathered sensor data that make precise analysis difficult. Therefore, the objective of this thesis is to develop an event-driven failure analysis method in order to take into account both functional interactions between subsystems and diverse user???s behaviors. To do this, we first apply various data filtering techniques to data cleaning and reduction, and then convert the filtered data into a new format of event sequence information (called ???eventization???). Four eventization strategies: equal-width binning, entropy, domain knowledge expert, and probability distribution estimation, are examined for data filtering, in order to extract only important information from the raw sensor data while minimizing information loss. By numerical simulation, we identify the optimal values of eventization parameters. Finally, the event sequence information containing the time gap between event occurrences is decoded to investigate the correlation between specific event sequence patterns and various system failures. These extracted patterns are stored in a failure pattern library, and then this pattern library is used as the main reference source to predict failures in real-time during the failure prognosis phase. The efficiency of the developed procedure is examined with a terminal box data log of marine diesel engines.ope
Fuzzy-Granular Based Data Mining for Effective Decision Support in Biomedical Applications
Due to complexity of biomedical problems, adaptive and intelligent knowledge discovery and data mining systems are highly needed to help humans to understand the inherent mechanism of diseases. For biomedical classification problems, typically it is impossible to build a perfect classifier with 100% prediction accuracy. Hence a more realistic target is to build an effective Decision Support System (DSS). In this dissertation, a novel adaptive Fuzzy Association Rules (FARs) mining algorithm, named FARM-DS, is proposed to build such a DSS for binary classification problems in the biomedical domain. Empirical studies show that FARM-DS is competitive to state-of-the-art classifiers in terms of prediction accuracy. More importantly, FARs can provide strong decision support on disease diagnoses due to their easy interpretability. This dissertation also proposes a fuzzy-granular method to select informative and discriminative genes from huge microarray gene expression data. With fuzzy granulation, information loss in the process of gene selection is decreased. As a result, more informative genes for cancer classification are selected and more accurate classifiers can be modeled. Empirical studies show that the proposed method is more accurate than traditional algorithms for cancer classification. And hence we expect that genes being selected can be more helpful for further biological studies
Extracção de regras de associação com itens raros e frequentes
Ao longo dos últimos anos, as regras de associação têm assumido um papel relevante
na extracção de informação e de conhecimento em base de dados e vêm com isso auxiliar
o processo de tomada de decisão.
A maioria dos trabalhos de investigação desenvolvidos sobre regras de associação
têm por base o modelo de suporte e confiança. Este modelo permite obter regras de
associação que envolvem particularmente conjuntos de itens frequentes.
Contudo, nos últimos anos, tem-se explorado conjuntos de itens que surgem com
menor frequência, designados de regras de associação raras ou infrequentes. Muitas das
regras com base nestes itens têm particular interesse para o utilizador. Actualmente a
investigação sobre regras de associação procuram incidir na geração do maior número
possível de regras com interesse aglomerando itens raros e frequentes.
Assim, este estudo foca, inicialmente, uma pesquisa sobre os principais algoritmos
de data mining que abordam as regras de associação.
A finalidade deste trabalho é examinar as técnicas e algoritmos de extracção de
regras de associação já existentes, verificar as principais vantagens e desvantagens dos
algoritmos na extracção de regras de associação e, por fim, desenvolver um algoritmo
cujo objectivo é gerar regras de associação que envolvem itens raros e frequentes.Over the past few years, association rules have taken an important paper in extracting
information and knowledge from database, which helps the decision-making process.
The most of the investigation works of in association rules is essentially based on
the model of support and confidence. This model enables to extract association rules
particularly related to frequent items.
However, in recent years, the need to explore less frequent itemsets, called rare or
unusual association rules, has increased. Many of these rules that involve infrequent
items are considered to have particular interest for the user.
Recently, efforts on the research of association rules have tried to generate the largest
possible number of interest rules agglomerating rare and frequent items.
This way, this study initially seals a research on the main algorithms of date mining
that approach the association rules.
An association rule is considered to be rare when it is formed by frequent and unusual
items or unusual items only.
The purpose of this study is to examine not only the techniques and algorithms
for the extraction of association rules that already exist, but also the main advantages
and disadvantages of the algorithms in the mining of association rules, and finally to
develop an algorithm whose objective is to generate association rules that involve rare
and frequent items