97,222 research outputs found
Association Rule Based Classification
In this thesis, we focused on the construction of classification models based on association rules. Although association rules have been predominantly used for data exploration and description, the interest in using them for prediction has rapidly increased in the data mining community. In order to mine only rules that can be used for classification, we modified the well known association rule mining algorithm Apriori to handle user-defined input constraints. We considered constraints that require the presence/absence of particular items, or that limit the number of items, in the antecedents and/or the consequents of the rules. We developed a characterization of those itemsets that will potentially form rules that satisfy the given constraints. This characterization allows us to prune during itemset construction itemsets such that neither they nor any of their supersets will form valid rules. This improves the time performance of itemset construction. Using this characterization, we implemented a classification system based on association rules and compared the performance of several model construction methods, including CBA, and several model deployment modes to make predictions. Although the data mining community has dealt only with the classification of single-valued attributes, there are several domains in which the classification target is set-valued. Hence, we enhanced our classification system with a novel approach to handle the prediction of set-valued class attributes. Since the traditional classification accuracy measure is inappropriate in this context, we developed an evaluation method for set-valued classification based on the E-Measure. Furthermore, we enhanced our algorithm by not relying on the typical support/confidence framework, and instead mining for the best possible rules above a user-defined minimum confidence and within a desired range for the number of rules. This avoids long mining times that might produce large collections of rules with low predictive power. For this purpose, we developed a heuristic function to determine an initial minimum support and then adjusted it using a binary search strategy until a number of rules within the given range was obtained. We implemented all of our techniques described above in WEKA, an open source suite of machine learning algorithms. We used several datasets from the UCI Machine Learning Repository to test and evaluate our techniques
Too Trivial To Test? An Inverse View on Defect Prediction to Identify Methods with Low Fault Risk
Background. Test resources are usually limited and therefore it is often not
possible to completely test an application before a release. To cope with the
problem of scarce resources, development teams can apply defect prediction to
identify fault-prone code regions. However, defect prediction tends to low
precision in cross-project prediction scenarios.
Aims. We take an inverse view on defect prediction and aim to identify
methods that can be deferred when testing because they contain hardly any
faults due to their code being "trivial". We expect that characteristics of
such methods might be project-independent, so that our approach could improve
cross-project predictions.
Method. We compute code metrics and apply association rule mining to create
rules for identifying methods with low fault risk. We conduct an empirical
study to assess our approach with six Java open-source projects containing
precise fault data at the method level.
Results. Our results show that inverse defect prediction can identify approx.
32-44% of the methods of a project to have a low fault risk; on average, they
are about six times less likely to contain a fault than other methods. In
cross-project predictions with larger, more diversified training sets,
identified methods are even eleven times less likely to contain a fault.
Conclusions. Inverse defect prediction supports the efficient allocation of
test resources by identifying methods that can be treated with less priority in
testing activities and is well applicable in cross-project prediction
scenarios.Comment: Submitted to PeerJ C
LC an effective classification based association rule mining algorithm
Classification using association rules is a research field in data mining that primarily uses association rule discovery techniques in classification benchmarks. It has been confirmed by many research studies in the literature that classification using association tends to generate more predictive classification systems than traditional classification data mining techniques like probabilistic, statistical and decision tree. In this thesis, we introduce a novel data mining algorithm based on classification using association called “Looking at the Class” (LC), which can be used in for mining a range of classification data sets. Unlike known algorithms in classification using the association approach such as Classification based on Association rule (CBA) system and Classification based on Predictive Association (CPAR) system, which merge disjoint items in the rule learning step without anticipating the class label similarity, the proposed algorithm merges only items with identical class labels. This saves too many unnecessary items combining during the rule learning step, and consequently results in large saving in computational time and memory.
Furthermore, the LC algorithm uses a novel prediction procedure that employs multiple rules to make the prediction decision instead of a single rule. The proposed algorithm has been evaluated thoroughly on real world security data sets collected using an automated tool developed at Huddersfield University. The security application which we have considered in this thesis is about categorizing websites based on their features to legitimate or fake which is a typical binary classification problem. Also, experimental results on a number of UCI data sets have been conducted and the measures used for evaluation is the classification accuracy, memory usage, and others. The results show that LC algorithm outperformed traditional classification algorithms such as C4.5, PART and Naïve Bayes as well as known classification based association algorithms like CBA with respect to classification accuracy, memory usage, and execution time on most data sets we consider
Protein secondary structure prediction using BLAST and relaxed threshold rule induction from coverings
Protein structure prediction has always been an important research area in bioinformatics and biochemistry. Despite the recent breakthrough of combining multiple sequence alignment information and artificial intelligence algorithms to predict protein secondary structure, the Q₃ accuracy of various computational prediction methods rarely has exceeded 75%; this status has changed little since 2003 when Rost stated that the currently best methods reach a level around 77% three-state per-residue accuracy. The application of artificial neural network methods to this problem is revolutionary in the sense that those techniques employ the homologues of proteins for training and prediction. In this dissertation, a different approach, RT-RICO (Relaxed Threshold Rule Induction from Coverings), is presented that instead uses association rule mining. This approach still makes use of the fundamental principle that structure is more conserved than sequence. However, rules between each known secondary structure element and its neighboring amino acid residues are established to perform the predictions. This dissertation consists of five research articles that discuss different prediction techniques and detailed rule-generation algorithms. The most recent prediction approach, BLAST-RT-RICO, achieved a Q₃ accuracy score of 89.93% on the standard test dataset RS126 and a Q₃ score of 87.71% on the standard test dataset CB396, an improvement over comparable computational methods. Herein one research article also discusses the results of examining those RT-RICO rules using an existing association rule visualization tool, modified to account for the non-Boolean characterization of protein secondary structure --Abstract, page iv
Promoter Sequences Prediction Using Relational Association Rule Mining
In this paper we are approaching, from a computational perspective, the problem of promoter sequences prediction, an important problem within the field of bioinformatics. As the conditions for a DNA sequence to function as a promoter are not known, machine learning based classification models are still developed to approach the problem of promoter identification in the DNA. We are proposing a classification model based on relational association rules mining. Relational association rules are a particular type of association rules and describe numerical orderings between attributes that commonly occur over a data set. Our classifier is based on the discovery of relational association rules for predicting if a DNA sequence contains or not a promoter region. An experimental evaluation of the proposed model and comparison with similar existing approaches is provided. The obtained results show that our classifier overperforms the existing techniques for identifying promoter sequences, confirming the potential of our proposal
Knowledge discovery in biological databases : a neural network approach
Knowledge discovery, in databases, also known as data mining, is aimed to find significant information from a set of data. The knowledge to be mined from the dataset may refer to patterns, association rules, classification and clustering rules, and so forth. In this dissertation, we present a neural network approach to finding knowledge in biological databases. Specifically, we propose new methods to process biological sequences in two case studies: the classification of protein sequences and the prediction of E. Coli promoters in DNA sequences. Our proposed methods, based oil neural network architectures combine techniques ranging from Bayesian inference, coding theory, feature selection, dimensionality reduction, to dynamic programming and machine learning algorithms. Empirical studies show that the proposed methods outperform previously published methods and have excellent performance on the latest dataset. We have implemented the proposed algorithms into an infrastructure, called Genome Mining, developed for biosequence classification and recognition
A COLLABORATIVE FILTERING APPROACH TO PREDICT WEB PAGES OF INTEREST FROMNAVIGATION PATTERNS OF PAST USERS WITHIN AN ACADEMIC WEBSITE
This dissertation is a simulation study of factors and techniques involved in designing hyperlink recommender systems that recommend to users, web pages that past users with similar navigation behaviors found interesting. The methodology involves identification of pertinent factors or techniques, and for each one, addresses the following questions: (a) room for improvement; (b) better approach, if any; and (c) performance characteristics of the technique in environments that hyperlink recommender systems operate in. The following four problems are addressed:Web Page Classification. A new metric (PageRank × Inverse Links-to-Word count ratio) is proposed for classifying web pages as content or navigation, to help in the discovery of user navigation behaviors from web user access logs. Results of a small user study suggest that this metric leads to desirable results.Data Mining. A new apriori algorithm for mining association rules from large databases is proposed. The new algorithm addresses the problem of scaling of the classical apriori algorithm by eliminating an expensive joinstep, and applying the apriori property to every row of the database. In this study, association rules show the correlation relationships between user navigation behaviors and web pages they find interesting. The new algorithm has better space complexity than the classical one, and better time efficiency under some conditionsand comparable time efficiency under other conditions.Prediction Models for User Interests. We demonstrate that association rules that show the correlation relationships between user navigation patterns and web pages they find interesting can be transformed intocollaborative filtering data. We investigate collaborative filtering prediction models based on two approaches for computing prediction scores: using simple averages and weighted averages. Our findings suggest that theweighted averages scheme more accurately computes predictions of user interests than the simple averages scheme does.Clustering. Clustering techniques are frequently applied in the design of personalization systems. We studied the performance of the CLARANS clustering algorithm in high dimensional space in relation to the PAM and CLARA clustering algorithms. While CLARA had the best time performance, CLARANS resulted in clusterswith the lowest intra-cluster dissimilarities, and so was most effective in this regard
A Relative Tolerance Relation of Rough Set (RTRS) for potential fish yields in Indonesia
The sea is essential to life on earth, including regulating the climate, producing oxygen, providing medicines, providing habitats for marine animals, and feeding millions of people. It must be ensured that the sea continues to meet the needs of life without sacrificing the people of future generations. The sea regulates the planet’s climate and is a significant source of nutrients. The sea becomes an essential part of global commerce, while the contents of the ocean become the solution of human energy needs today and the future. The wealth and potential of the sea as a source of energy for humans today and the future needs to be mapped and described to provide a picture of marine potential to all concerned. As part of the government, the Ministry of Marine Affairs and Fisheries is responsible for the process of formulating, determining, and implementing policies in the field of marine and fisheries based on the results of mapping and extracting information from existing conditions. The results of this information can be used to predict the marine potential in a marine area. This prediction process can be developed using data-mining techniques such as applying the association rule by looking at the relationship between the quantity of fish based on the plankton abundance index. However, this association rules data-mining techniques that require complete data, which are data sets with no missing values to generate interesting rules for detection systems. The problem is often that required marine data are not available or that marine data are available, but they contain incomplete data. To address this problem, this paper introduces a Relative Tolerance Relation of Rough Set (RTRS). Novelty RTRS differs from previous rough approaches that use tolerance relationships, nonsymmetric equation relationships, and limited tolerance relationships. The RTRS approach is based on a limited tolerance relationship considering the relative precision between two objects; therefore, this is the first job to use relative precision. In addition, this paper presents the mathematical approach of the RTRS and compares it with other existing approaches using the marine real dataset to classify the marine potential level of the region. The results show that the proposed approach is better than the existing approach in terms of accuracy
Recommended from our members
Enhancing Fuzzy Associative Rule Mining Approaches for Improving Prediction Accuracy. Integration of Fuzzy Clustering, Apriori and Multiple Support Approaches to Develop an Associative Classification Rule Base
Building an accurate and reliable model for prediction for different application domains, is one of the most significant challenges in knowledge discovery and data mining. This thesis focuses on building and enhancing a generic predictive model for estimating a future value by extracting association rules (knowledge) from a quantitative database. This model is applied to several data sets obtained from different benchmark problems, and the results are evaluated through extensive experimental tests.
The thesis presents an incremental development process for the prediction model with three stages. Firstly, a Knowledge Discovery (KD) model is proposed by integrating Fuzzy C-Means (FCM) with Apriori approach to extract Fuzzy Association Rules (FARs) from a database for building a Knowledge Base (KB) to predict a future value. The KD model has been tested with two road-traffic data sets.
Secondly, the initial model has been further developed by including a diversification method in order to improve a reliable FARs to find out the best and representative rules. The resulting Diverse Fuzzy Rule Base (DFRB) maintains high quality and diverse FARs offering a more reliable and generic model. The model uses FCM to transform quantitative data into fuzzy ones, while a Multiple Support Apriori (MSapriori) algorithm is adapted to extract the FARs from fuzzy data. The correlation values for these FARs are calculated, and an efficient orientation for filtering FARs is performed as a post-processing method. The FARs diversity is maintained through the clustering of FARs, based on the concept of the sharing function technique used in multi-objectives optimization. The best and the most diverse FARs are obtained as the DFRB to utilise within the Fuzzy Inference System (FIS) for prediction.
The third stage of development proposes a hybrid prediction model called Fuzzy Associative Classification Rule Mining (FACRM) model. This model integrates the
ii
improved Gustafson-Kessel (G-K) algorithm, the proposed Fuzzy Associative Classification Rules (FACR) algorithm and the proposed diversification method. The improved G-K algorithm transforms quantitative data into fuzzy data, while the FACR generate significant rules (Fuzzy Classification Association Rules (FCARs)) by employing the improved multiple support threshold, associative classification and vertical scanning format approaches. These FCARs are then filtered by calculating the correlation value and the distance between them. The advantage of the proposed FACRM model is to build a generalized prediction model, able to deal with different application domains. The validation of the FACRM model is conducted using different benchmark data sets from the University of California, Irvine (UCI) of machine learning and KEEL (Knowledge Extraction based on Evolutionary Learning) repositories, and the results of the proposed FACRM are also compared with other existing prediction models. The experimental results show that the error rate and generalization performance of the proposed model is better in the majority of data sets with respect to the commonly used models.
A new method for feature selection entitled Weighting Feature Selection (WFS) is also proposed. The WFS method aims to improve the performance of FACRM model. The prediction performance is improved by minimizing the prediction error and reducing the number of generated rules. The prediction results of FACRM by employing WFS have been compared with that of FACRM and Stepwise Regression (SR) models for different data sets. The performance analysis and comparative study show that the proposed prediction model provides an effective approach that can be used within a decision support system.Applied Science University (ASU) of Jorda
- …