Search CORE

222 research outputs found

Dynamic Rule Covering Classification in Data Mining with Cyber Security Phishing Application

Author: Qabajeh Issa Mohammad
Publication venue: Centre for Computational Intelligence
Publication date: 01/05/2017
Field of study

Data mining is the process of discovering useful patterns from datasets using intelligent techniques to help users make certain decisions. A typical data mining task is classification, which involves predicting a target variable known as the class in previously unseen data based on models learnt from an input dataset. Covering is a well-known classification approach that derives models with If-Then rules. Covering methods, such as PRISM, have a competitive predictive performance to other classical classification techniques such as greedy, decision tree and associative classification. Therefore, Covering models are appropriate decision-making tools and users favour them carrying out decisions. Despite the use of Covering approach in data processing for different classification applications, it is also acknowledged that this approach suffers from the noticeable drawback of inducing massive numbers of rules making the resulting model large and unmanageable by users. This issue is attributed to the way Covering techniques induce the rules as they keep adding items to the rule’s body, despite the limited data coverage (number of training instances that the rule classifies), until the rule becomes with zero error. This excessive learning overfits the training dataset and also limits the applicability of Covering models in decision making, because managers normally prefer a summarised set of knowledge that they are able to control and comprehend rather a high maintenance models. In practice, there should be a trade-off between the number of rules offered by a classification model and its predictive performance. Another issue associated with the Covering models is the overlapping of training data among the rules, which happens when a rule’s classified data are discarded during the rule discovery phase. Unfortunately, the impact of a rule’s removed data on other potential rules is not considered by this approach. However, When removing training data linked with a rule, both frequency and rank of other rules’ items which have appeared in the removed data are updated. The impacted rules should maintain their true rank and frequency in a dynamic manner during the rule discovery phase rather just keeping the initial computed frequency from the original input dataset. In response to the aforementioned issues, a new dynamic learning technique based on Covering and rule induction, that we call Enhanced Dynamic Rule Induction (eDRI), is developed. eDRI has been implemented in Java and it has been embedded in WEKA machine learning tool. The developed algorithm incrementally discovers the rules using primarily frequency and rule strength thresholds. These thresholds in practice limit the search space for both items as well as potential rules by discarding any with insufficient data representation as early as possible resulting in an efficient training phase. More importantly, eDRI substantially cuts down the number of training examples scans by continuously updating potential rules’ frequency and strength parameters in a dynamic manner whenever a rule gets inserted into the classifier. In particular, and for each derived rule, eDRI adjusts on the fly the remaining potential rules’ items frequencies as well as ranks specifically for those that appeared within the deleted training instances of the derived rule. This gives a more realistic model with minimal rules redundancy, and makes the process of rule induction efficient and dynamic and not static. Moreover, the proposed technique minimises the classifier’s number of rules at preliminary stages by stopping learning when any rule does not meet the rule’s strength threshold therefore minimising overfitting and ensuring a manageable classifier. Lastly, eDRI prediction procedure not only priorities using the best ranked rule for class forecasting of test data but also restricts the use of the default class rule thus reduces the number of misclassifications. The aforementioned improvements guarantee classification models with smaller size that do not overfit the training dataset, while maintaining their predictive performance. The eDRI derived models particularly benefit greatly users taking key business decisions since they can provide a rich knowledge base to support their decision making. This is because these models’ predictive accuracies are high, easy to understand, and controllable as well as robust, i.e. flexible to be amended without drastic change. eDRI applicability has been evaluated on the hard problem of phishing detection. Phishing normally involves creating a fake well-designed website that has identical similarity to an existing business trustful website aiming to trick users and illegally obtain their credentials such as login information in order to access their financial assets. The experimental results against large phishing datasets revealed that eDRI is highly useful as an anti-phishing tool since it derived manageable size models when compared with other traditional techniques without hindering the classification performance. Further evaluation results using other several classification datasets from different domains obtained from University of California Data Repository have corroborated eDRI’s competitive performance with respect to accuracy, number of knowledge representation, training time and items space reduction. This makes the proposed technique not only efficient in inducing rules but also effective

De Montfort University Open Research Archive

LC an effective classification based association rule mining algorithm

Author: Mahmood Qazafi
Publication venue
Publication date
Field of study

Classification using association rules is a research field in data mining that primarily uses association rule discovery techniques in classification benchmarks. It has been confirmed by many research studies in the literature that classification using association tends to generate more predictive classification systems than traditional classification data mining techniques like probabilistic, statistical and decision tree. In this thesis, we introduce a novel data mining algorithm based on classification using association called “Looking at the Class” (LC), which can be used in for mining a range of classification data sets. Unlike known algorithms in classification using the association approach such as Classification based on Association rule (CBA) system and Classification based on Predictive Association (CPAR) system, which merge disjoint items in the rule learning step without anticipating the class label similarity, the proposed algorithm merges only items with identical class labels. This saves too many unnecessary items combining during the rule learning step, and consequently results in large saving in computational time and memory. Furthermore, the LC algorithm uses a novel prediction procedure that employs multiple rules to make the prediction decision instead of a single rule. The proposed algorithm has been evaluated thoroughly on real world security data sets collected using an automated tool developed at Huddersfield University. The security application which we have considered in this thesis is about categorizing websites based on their features to legitimate or fake which is a typical binary classification problem. Also, experimental results on a number of UCI data sets have been conducted and the measures used for evaluation is the classification accuracy, memory usage, and others. The results show that LC algorithm outperformed traditional classification algorithms such as C4.5, PART and Naïve Bayes as well as known classification based association algorithms like CBA with respect to classification accuracy, memory usage, and execution time on most data sets we consider

University of Huddersfield Repository

A Comprehensive Survey of Data Mining-based Fraud Detection Research

Author: Agrawal
Au
Berry
Brentnall
Chen
Chiang Wang
David C. Yen
Feelders
Han
Hayhoe
Kirkosa
Ku
Leonard
Mitchell
Ngai
Quah
Rothman
Shaw
Shing-Han Li
Song
Sudjianto
Titus
Wen-Hui Lu
White
Publication venue: 'Elsevier BV'
Publication date: 30/09/2010
Field of study

This survey paper categorises, compares, and summarises from almost all published technical and review articles in automated fraud detection within the last 10 years. It defines the professional fraudster, formalises the main types and subtypes of known fraud, and presents the nature of data evidence collected within affected industries. Within the business context of mining the data to achieve higher cost savings, this research presents methods and techniques together with their problems. Compared to all related reviews on fraud detection, this survey covers much more technical articles and is the only one, to the best of our knowledge, which proposes alternative data and solutions from related domains.Comment: 14 page

arXiv.org e-Print Archive

Crossref

Recommended from our members

Phishing website detection using intelligent data mining techniques. Design and development of an intelligent association classification mining fuzzy based scheme for phishing website detection with an emphasis on E-banking.

Author: Abur-rous Maher Ragheb Mohammed
Publication venue: Department of Computing
Publication date: 01/01/2010
Field of study

Phishing techniques have not only grown in number, but also in sophistication. Phishers might have a lot of approaches and tactics to conduct a well-designed phishing attack. The targets of the phishing attacks, which are mainly on-line banking consumers and payment service providers, are facing substantial financial loss and lack of trust in Internet-based services. In order to overcome these, there is an urgent need to find solutions to combat phishing attacks. Detecting phishing website is a complex task which requires significant expert knowledge and experience. So far, various solutions have been proposed and developed to address these problems. Most of these approaches are not able to make a decision dynamically on whether the site is in fact phished, giving rise to a large number of false positives. This is mainly due to limitation of the previously proposed approaches, for example depending only on fixed black and white listing database, missing of human intelligence and experts, poor scalability and their timeliness. In this research we investigated and developed the application of an intelligent fuzzy-based classification system for e-banking phishing website detection. The main aim of the proposed system is to provide protection to users from phishers deception tricks, giving them the ability to detect the legitimacy of the websites. The proposed intelligent phishing detection system employed Fuzzy Logic (FL) model with association classification mining algorithms. The approach combined the capabilities of fuzzy reasoning in measuring imprecise and dynamic phishing features, with the capability to classify the phishing fuzzy rules. Different phishing experiments which cover all phishing attacks, motivations and deception behaviour techniques have been conducted to cover all phishing concerns. A layered fuzzy structure has been constructed for all gathered and extracted phishing website features and patterns. These have been divided into 6 criteria and distributed to 3 layers, based on their attack type. To reduce human knowledge intervention, Different classification and association algorithms have been implemented to generate fuzzy phishing rules automatically, to be integrated inside the fuzzy inference engine for the final phishing detection. Experimental results demonstrated that the ability of the learning approach to identify all relevant fuzzy rules from the training data set. A comparative study and analysis showed that the proposed learning approach has a higher degree of predictive and detective capability than existing models. Experiments also showed significance of some important phishing criteria like URL & Domain Identity, Security & Encryption to the final phishing detection rate. Finally, our proposed intelligent phishing website detection system was developed, tested and validated by incorporating the scheme as a web based plug-ins phishing toolbar. The results obtained are promising and showed that our intelligent fuzzy based classification detection system can provide an effective help for real-time phishing website detection. The toolbar successfully recognized and detected approximately 92% of the phishing websites selected from our test data set, avoiding many miss-classified websites and false phishing alarms

Bradford Scholars

Recommended from our members

A novel knowledge discovery based approach for supplier risk scoring with application in the HVAC industry

Author: Chuddher Bilal Akbar
Publication venue: Brunel University London
Publication date: 01/01/2015
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University LondonThis research has led to a novel methodology for assessment and quantification of supply risks in the supply chain. The research has built on advanced Knowledge Discovery techniques and has resulted to a software implementation to be able to do so. The methodology developed and presented here resembles the well-known consumer credit scoring methods as it leads to a similar metric, or score, for assessing a supplier’s reliability and risk of conducting business with that supplier. However, the focus is on a wide range of operational metrics rather than just financial, which credit scoring techniques typically focus on. The core of the methodology comprises the application of Knowledge Discovery techniques to extract the likelihood of possible risks from within a range of available datasets. In combination with cross-impact analysis, those datasets are examined for establish the inter-relationships and mutual connections among several factors that are likely contribute to risks associated with particular suppliers. This approach is called conjugation analysis. The resulting parameters become the inputs into a logistic regression which leads to a risk scoring model the outcome of the process is the standardized risk score which is analogous to the well-known consumer risk scoring model, better known as FICO score. The proposed methodology has been applied to an Air Conditioning manufacturing company. Two models have been developed. The first identifies the supply risks based on the data about purchase orders and selected risk factors. With this model the likelihoods of delivery failures, quality failures and cost failures are obtained. The second model built on the first one but also used the actual data about the performance of supplier to identify risks of conducting business with particular suppliers. Its target was to provide quantitative measures of an individual supplier’s risk level. The supplier risk scoring model is tested on the data acquired from the company for its performance analysis. The supplier risk scoring model achieved 86.2% accuracy, while the area under curve (AUC) was 0.863. The AUC curve is much higher than required model’s validity threshold value of 0.5. It represents developed model’s validity and reliability for future data. The numerical studies conducted with real-life datasets have demonstrated the effectiveness of the proposed methodology and system as well as its future potential for industrial adoption

Brunel University Research Archive

Utility-Based Data Mining Workshop Chairs:

Author: Bianca Zadrozny
Gary Weiss
Maytal Saar-tsechansky
Publication venue
Publication date
Field of study

CiteSeerX

Deriving Classifiers with Single and Multi-Label Rules using New Associative Classification Methods

Author: Abdelhamid Neda
Publication venue: Faculty of Technology
Publication date: 01/11/2013
Field of study

Associative Classification (AC) in data mining is a rule based approach that uses association rule techniques to construct accurate classification systems (classifiers). The majority of existing AC algorithms extract one class per rule and ignore other class labels even when they have large data representation. Thus, extending current AC algorithms to find and extract multi-label rules is promising research direction since new hidden knowledge is revealed for decision makers. Furthermore, the exponential growth of rules in AC has been investigated in this thesis aiming to minimise the number of candidate rules, and therefore reducing the classifier size so end-user can easily exploit and maintain it. Moreover, an investigation to both rule ranking and test data classification steps have been conducted in order to improve the performance of AC algorithms in regards to predictive accuracy. Overall, this thesis investigates different problems related to AC not limited to the ones listed above, and the results are new AC algorithms that devise single and multi-label rules from different applications data sets, together with comprehensive experimental results. To be exact, the first algorithm proposed named Multi-class Associative Classifier (MAC): This algorithm derives classifiers where each rule is connected with a single class from a training data set. MAC enhanced the rule discovery, rule ranking, rule filtering and classification of test data in AC. The second algorithm proposed is called Multi-label Classifier based Associative Classification (MCAC) that adds on MAC a novel rule discovery method which discovers multi-label rules from single label data without learning from parts of the training data set. These rules denote vital information ignored by most current AC algorithms which benefit both the end-user and the classifier’s predictive accuracy. Lastly, the vital problem related to web threats called “website phishing detection” was deeply investigated where a technical solution based on AC has been introduced in Chapter 6. Particularly, we were able to detect new type of knowledge and enhance the detection rate with respect to error rate using our proposed algorithms and against a large collected phishing data set. Thorough experimental tests utilising large numbers of University of California Irvine (UCI) data sets and a variety of real application data collections related to website classification and trainer timetabling problems reveal that MAC and MCAC generates better quality classifiers if compared with other AC and rule based algorithms with respect to various evaluation measures, i.e. error rate, Label-Weight, Any-Label, number of rules, etc. This is mainly due to the different improvements related to rule discovery, rule filtering, rule sorting, classification step, and more importantly the new type of knowledge associated with the proposed algorithms. Most chapters in this thesis have been disseminated or under review in journals and refereed conference proceedings

De Montfort University Open Research Archive