2 research outputs found
Study on ensemble classification methods towards spam filtering
Recently, many scholars make use of fusion of filters to enhance the performance of spam filtering. In the past several years, a lot of effort has been devoted to different ensemble methods to achieve better performance. In reality, how to select appropriate ensemble methods towards spam filtering is an unsolved problem. In this paper, we investigate this problem through designing a framework to compare the performances among various ensemble methods. It is helpful for researchers to fight spam email more effectively in applied systems. The experimental results indicate that online based methods perform well on accuracy, while the off-line batch methods are evidently influenced by the size of data set. When a large data set is involved, the performance of off-line batch methods is not at par with online methods, and in the framework of online methods, the performance of parallel ensemble is better when using complex filters only.<br /
An Ensemble Self-Structuring Neural Network Approach to Solving Classification Problems with Virtual Concept Drift and its Application to Phishing Websites
Classification in data mining is one of the well-known tasks that aim to construct a
classification model from a labelled input data set. Most classification models are
devoted to a static environment where the complete training data set is presented to the
classification algorithm. This data set is assumed to cover all information needed to
learn the pertinent concepts (rules and patterns) related to how to classify unseen
examples to predefined classes. However, in dynamic (non-stationary) domains, the set
of features (input data attributes) may change over time. For instance, some features
that are considered significant at time Ti might become useless or irrelevant at time Ti+j.
This situation results in a phenomena called Virtual Concept Drift. Yet, the set of
features that are dropped at time Ti+j might return to become significant again in the
future. Such a situation results in the so-called Cyclical Concept Drift, which is a direct
result of the frequently called catastrophic forgetting dilemma. Catastrophic forgetting
happens when the learning of new knowledge completely removes the previously
learned knowledge.
Phishing is a dynamic classification problem where a virtual concept drift might occur.
Yet, the virtual concept drift that occurs in phishing might be guided by some
malevolent intelligent agent rather than occurring naturally. One reason why phishers
keep changing the features combination when creating phishing websites might be that
they have the ability to interpret the anti-phishing tool and thus they pick a new set of
features that can circumvent it. However, besides the generalisation capability, fault
tolerance, and strong ability to learn, a Neural Network (NN) classification model is
considered as a black box. Hence, if someone has the skills to hack into the NN based
classification model, he might face difficulties to interpret and understand how the NN
processes the input data in order to produce the final decision (assign class value).
In this thesis, we investigate the problem of virtual concept drift by proposing a
framework that can keep pace with the continuous changes in the input features. The
proposed framework has been applied to phishing websites classification problem and
it shows competitive results with respect to various evaluation measures (Harmonic
Mean (F1-score), precision, accuracy, etc.) when compared to several other data mining
techniques. The framework creates an ensemble of classifiers (group of classifiers) and it
offers a balance between stability (maintaining previously learned knowledge) and
plasticity (learning knowledge from the newly offered training data set). Hence, the
framework can also handle the cyclical concept drift. The classifiers that constitute the
ensemble are created using an improved Self-Structuring Neural Networks algorithm
(SSNN). Traditionally, NN modelling techniques rely on trial and error, which is a
tedious and time-consuming process. The SSNN simplifies structuring NN classifiers
with minimum intervention from the user. The framework evaluates the ensemble
whenever a new data set chunk is collected. If the overall accuracy of the combined
results from the ensemble drops significantly, a new classifier is created using the SSNN
and added to the ensemble. Overall, the experimental results show that the proposed
framework affords a balance between stability and plasticity and can effectively handle
the virtual concept drift when applied to phishing websites classification problem. Most
of the chapters of this thesis have been subject to publicatio