10 research outputs found
The Minimum Description Length Principle for Pattern Mining: A Survey
This is about the Minimum Description Length (MDL) principle applied to
pattern mining. The length of this description is kept to the minimum.
Mining patterns is a core task in data analysis and, beyond issues of
efficient enumeration, the selection of patterns constitutes a major challenge.
The MDL principle, a model selection method grounded in information theory, has
been applied to pattern mining with the aim to obtain compact high-quality sets
of patterns. After giving an outline of relevant concepts from information
theory and coding, as well as of work on the theory behind the MDL and similar
principles, we review MDL-based methods for mining various types of data and
patterns. Finally, we open a discussion on some issues regarding these methods,
and highlight currently active related data analysis problems
LC an effective classification based association rule mining algorithm
Classification using association rules is a research field in data mining that primarily uses association rule discovery techniques in classification benchmarks. It has been confirmed by many research studies in the literature that classification using association tends to generate more predictive classification systems than traditional classification data mining techniques like probabilistic, statistical and decision tree. In this thesis, we introduce a novel data mining algorithm based on classification using association called “Looking at the Class” (LC), which can be used in for mining a range of classification data sets. Unlike known algorithms in classification using the association approach such as Classification based on Association rule (CBA) system and Classification based on Predictive Association (CPAR) system, which merge disjoint items in the rule learning step without anticipating the class label similarity, the proposed algorithm merges only items with identical class labels. This saves too many unnecessary items combining during the rule learning step, and consequently results in large saving in computational time and memory.
Furthermore, the LC algorithm uses a novel prediction procedure that employs multiple rules to make the prediction decision instead of a single rule. The proposed algorithm has been evaluated thoroughly on real world security data sets collected using an automated tool developed at Huddersfield University. The security application which we have considered in this thesis is about categorizing websites based on their features to legitimate or fake which is a typical binary classification problem. Also, experimental results on a number of UCI data sets have been conducted and the measures used for evaluation is the classification accuracy, memory usage, and others. The results show that LC algorithm outperformed traditional classification algorithms such as C4.5, PART and Naïve Bayes as well as known classification based association algorithms like CBA with respect to classification accuracy, memory usage, and execution time on most data sets we consider
Recommended from our members
Phishing website detection using intelligent data mining techniques. Design and development of an intelligent association classification mining fuzzy based scheme for phishing website detection with an emphasis on E-banking.
Phishing techniques have not only grown in number, but also in sophistication. Phishers might
have a lot of approaches and tactics to conduct a well-designed phishing attack. The targets of
the phishing attacks, which are mainly on-line banking consumers and payment service
providers, are facing substantial financial loss and lack of trust in Internet-based services. In
order to overcome these, there is an urgent need to find solutions to combat phishing attacks.
Detecting phishing website is a complex task which requires significant expert knowledge and
experience. So far, various solutions have been proposed and developed to address these
problems. Most of these approaches are not able to make a decision dynamically on whether the
site is in fact phished, giving rise to a large number of false positives. This is mainly due to
limitation of the previously proposed approaches, for example depending only on fixed black
and white listing database, missing of human intelligence and experts, poor scalability and their
timeliness.
In this research we investigated and developed the application of an intelligent fuzzy-based
classification system for e-banking phishing website detection. The main aim of the proposed
system is to provide protection to users from phishers deception tricks, giving them the ability
to detect the legitimacy of the websites. The proposed intelligent phishing detection system
employed Fuzzy Logic (FL) model with association classification mining algorithms. The
approach combined the capabilities of fuzzy reasoning in measuring imprecise and dynamic
phishing features, with the capability to classify the phishing fuzzy rules. Different phishing experiments which cover all phishing attacks, motivations and deception
behaviour techniques have been conducted to cover all phishing concerns. A layered fuzzy
structure has been constructed for all gathered and extracted phishing website features and
patterns. These have been divided into 6 criteria and distributed to 3 layers, based on their attack
type. To reduce human knowledge intervention, Different classification and association
algorithms have been implemented to generate fuzzy phishing rules automatically, to be
integrated inside the fuzzy inference engine for the final phishing detection.
Experimental results demonstrated that the ability of the learning approach to identify all
relevant fuzzy rules from the training data set. A comparative study and analysis showed that
the proposed learning approach has a higher degree of predictive and detective capability than
existing models. Experiments also showed significance of some important phishing criteria like
URL & Domain Identity, Security & Encryption to the final phishing detection rate.
Finally, our proposed intelligent phishing website detection system was developed, tested and
validated by incorporating the scheme as a web based plug-ins phishing toolbar. The results
obtained are promising and showed that our intelligent fuzzy based classification detection
system can provide an effective help for real-time phishing website detection. The toolbar
successfully recognized and detected approximately 92% of the phishing websites selected from
our test data set, avoiding many miss-classified websites and false phishing alarms
Deriving Classifiers with Single and Multi-Label Rules using New Associative Classification Methods
Associative Classification (AC) in data mining is a rule based approach that uses association rule techniques to construct accurate classification systems (classifiers). The majority of existing AC algorithms extract one class per rule and ignore other class labels even when they have large data representation. Thus, extending current AC algorithms to find and extract multi-label rules is promising research direction since new hidden knowledge is revealed for decision makers. Furthermore, the exponential growth of rules in AC has been investigated in this thesis aiming to minimise the number of candidate rules, and therefore reducing the classifier size so end-user can easily exploit and maintain it. Moreover, an investigation to both rule ranking and test data classification steps have been conducted in order to improve the performance of AC algorithms in regards to predictive accuracy.
Overall, this thesis investigates different problems related to AC not limited to the ones listed above, and the results are new AC algorithms that devise single and multi-label rules from different applications data sets, together with comprehensive experimental results. To be exact, the first algorithm proposed named Multi-class Associative Classifier (MAC): This algorithm derives classifiers where each rule is connected with a single class from a training data set. MAC enhanced the rule discovery, rule ranking, rule filtering and classification of test data in AC. The second algorithm proposed is called Multi-label Classifier based Associative Classification (MCAC) that adds on MAC a novel rule discovery method which discovers multi-label rules from single label data without learning from parts of the training data set. These rules denote vital information ignored by most current AC algorithms which benefit both the end-user and the classifier’s predictive accuracy. Lastly, the vital problem related to web threats called “website phishing detection” was deeply investigated where a technical solution based on AC has been introduced in Chapter 6. Particularly, we were able to detect new type of knowledge and enhance the detection rate with respect to error rate using our proposed algorithms and against a large collected phishing data set.
Thorough experimental tests utilising large numbers of University of California Irvine (UCI) data sets and a variety of real application data collections related to website classification and trainer timetabling problems reveal that MAC and MCAC generates better quality classifiers if compared with other AC and rule based algorithms with respect to various evaluation measures, i.e. error rate, Label-Weight, Any-Label, number of rules, etc. This is mainly due to the different improvements related to rule discovery, rule filtering, rule sorting, classification step, and more importantly the new type of knowledge associated with the proposed algorithms. Most chapters in this thesis have been disseminated or under review in journals and refereed conference proceedings
AVATAR - Machine Learning Pipeline Evaluation Using Surrogate Model
© 2020, The Author(s). The evaluation of machine learning (ML) pipelines is essential during automatic ML pipeline composition and optimisation. The previous methods such as Bayesian-based and genetic-based optimisation, which are implemented in Auto-Weka, Auto-sklearn and TPOT, evaluate pipelines by executing them. Therefore, the pipeline composition and optimisation of these methods requires a tremendous amount of time that prevents them from exploring complex pipelines to find better predictive models. To further explore this research challenge, we have conducted experiments showing that many of the generated pipelines are invalid, and it is unnecessary to execute them to find out whether they are good pipelines. To address this issue, we propose a novel method to evaluate the validity of ML pipelines using a surrogate model (AVATAR). The AVATAR enables to accelerate automatic ML pipeline composition and optimisation by quickly ignoring invalid pipelines. Our experiments show that the AVATAR is more efficient in evaluating complex pipelines in comparison with the traditional evaluation approaches requiring their execution
Recommended from our members
E-banking operational risk assessment. A soft computing approach in the context of the Nigerian banking industry.
This study investigates E-banking Operational Risk Assessment (ORA) to enable the development of a new ORA framework and methodology. The general view is that E-banking systems have modified some of the traditional banking risks, particularly Operational Risk (OR) as suggested by the Basel Committee on Banking Supervision in 2003. In addition, recent E-banking financial losses together with risk management principles and standards raise the need for an effective ORA methodology and framework in the context of E-banking. Moreover, evaluation tools and / or methods for ORA are highly subjective, are still in their infant stages, and have not yet reached a consensus. Therefore, it is essential to develop valid and reliable methods for effective ORA and evaluations.
The main contribution of this thesis is to apply Fuzzy Inference System (FIS) and Tree Augmented Naïve Bayes (TAN) classifier as standard tools for identifying OR, and measuring OR exposure level. In addition, a new ORA methodology is proposed which consists of four major steps: a risk model, assessment approach, analysis approach and a risk assessment process. Further, a new ORA framework and measurement metrics are proposed with six factors: frequency of triggering event, effectiveness of avoidance barriers, frequency of undesirable operational state, effectiveness of recovery barriers before the risk outcome, approximate cost for Undesirable Operational State (UOS) occurrence, and severity of the risk outcome.
The study results were reported based on surveys conducted with Nigerian senior banking officers and banking customers. The study revealed that the framework and assessment tools gave good predictions for risk learning and inference in such systems. Thus, results obtained can be considered promising and useful for both E-banking system adopters and future researchers in this area
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
Personality Identification from Social Media Using Deep Learning: A Review
Social media helps in sharing of ideas and information among people scattered around the world and thus helps in creating communities, groups, and virtual networks. Identification of personality is significant in many types of applications such as in detecting the mental state or character of a person, predicting job satisfaction, professional and personal relationship success, in recommendation systems. Personality is also an important factor to determine individual variation in thoughts, feelings, and conduct systems. According to the survey of Global social media research in 2018, approximately 3.196 billion social media users are in worldwide. The numbers are estimated to grow rapidly further with the use of mobile smart devices and advancement in technology. Support vector machine (SVM), Naive Bayes (NB), Multilayer perceptron neural network, and convolutional neural network (CNN) are some of the machine learning techniques used for personality identification in the literature review. This paper presents various studies conducted in identifying the personality of social media users with the help of machine learning approaches and the recent studies that targeted to predict the personality of online social media (OSM) users are reviewed