9 research outputs found

    A review of spam email detection: analysis of spammer strategies and the dataset shift problem

    Get PDF
    .Spam emails have been traditionally seen as just annoying and unsolicited emails containing advertisements, but they increasingly include scams, malware or phishing. In order to ensure the security and integrity for the users, organisations and researchers aim to develop robust filters for spam email detection. Recently, most spam filters based on machine learning algorithms published in academic journals report very high performance, but users are still reporting a rising number of frauds and attacks via spam emails. Two main challenges can be found in this field: (a) it is a very dynamic environment prone to the dataset shift problem and (b) it suffers from the presence of an adversarial figure, i.e. the spammer. Unlike classical spam email reviews, this one is particularly focused on the problems that this constantly changing environment poses. Moreover, we analyse the different spammer strategies used for contaminating the emails, and we review the state-of-the-art techniques to develop filters based on machine learning. Finally, we empirically evaluate and present the consequences of ignoring the matter of dataset shift in this practical field. Experimental results show that this shift may lead to severe degradation in the estimated generalisation performance, with error rates reaching values up to 48.81%.SIPublicación en abierto financiada por el Consorcio de Bibliotecas Universitarias de Castilla y León (BUCLE), con cargo al Programa Operativo 2014ES16RFOP009 FEDER 2014-2020 DE CASTILLA Y LEÓN, Actuación:20007-CL - Apoyo Consorcio BUCL

    Categorical Change: Exploring the Effects of Concept Drift in Human Perceptual Category Learning

    Get PDF
    Categorization is an essential survival skill that we engage in daily. A multitude of behavioral and neuropsychological evidence support the existence of multiple learning systems involved in category learning. COmpetition between Verbal and Implicit Systems (COVIS) theory provides a neuropsychological basis for the existence of an explicit and implicit learning system involved in the learning of category rules. COVIS provides a convincing account of asymptotic performance in human category learning. However, COVIS – and virtually all current theories of category learning – focus solely on categories and decision environments that remain stationary over time. However, our environment is dynamic, and we often need to adapt our decision making to account for environmental or categorical changes. Machine learning addresses this significant challenge through what is termed concept drift. Concept drift occurs any time a data distribution changes over time. This dissertation draws from two key characteristics of concept drift in machine learning known to impact the performance of learning models, and in-so-doing provides the first systematic exploration of concept drift (i.e., categorical change) in human perceptual category learning. Four experiments, each including one key change parameter (category base-rates, payoffs, or category structure [RB/II]), investigated the effect of rate of change (abrupt, gradual) and awareness of change (foretold or not) on decision criterion adaptation. Critically, Experiments 3 and 4 evaluated differences in categorical adaptation within explicit and implicit category learning tasks to determine if rate and awareness of change moderated any learning system differences. The results of these experiments inform current category learning theory and provide information for machine learning models of decision support in non-stationary environments

    Evaluating Classifiers During Dataset Shift

    Get PDF
    Deployment of a classifier into a machine learning application likely begins with training different types of algorithms on a subset of the available historical data and then evaluating them on datasets that are drawn from identical distributions. The goal of this evaluation process is to select the classifier that is believed to be most robust in maintaining good future performance, and then deploy that classifier to end-users who use it to make predictions on new data. Often times, predictive models are deployed in conditions that differ from those used in training, meaning that dataset shift occurred. In these situations, there are no guarantees that predictions made by the predictive model in deployment will still be as reliable and accurate as they were during the training of the model. This study demonstrated a technique that can be utilized by others when selecting a classifier for deployment, as well as the first comparative study that evaluates machine learning classifier performance on synthetic datasets with different levels of prior-probability, covariate, and concept dataset shifts. The results from this study showed the impact of dataset shift on the performance of different classifiers for two real-world datasets related to teacher retention in Wisconsin and detecting fraud in testing, as well as demonstrated a framework that can be used by others when selecting a classifier for deployment. By using the methods from this study as a proactive approach to evaluate classifiers on synthetic dataset shift, different classifiers would have been considered for deployment of both predictive models, compared to only using evaluation datasets that were drawn from identical distributions. The results from both real-world datasets also showed that there was no classifier that dealt well with prior-probability shift and that classifiers were affected less by covariate and concept shift than was expected. Two supplemental demonstrations of the methodology showed that it can be extended for additional purposes of evaluating classifiers on dataset shift. Results from analyzing the effects of hyperparameter choices on classifier performance under dataset shift, as well as the effects of actual dataset shift on classifier performance, showed that different hyperparameter configurations have an impact on the performance of a classifier in general, but can also have an impact on how robust that classifier might be to dataset shift

    Assessing the Impact of Changing Environments on Classifier Performance

    No full text
    Abstract. The purpose of this paper is to test the hypothesis that simple classifiers are more robust to changing environments than complex ones. We propose a strategy for generating artificial, but realistic domains, which allows us to control the changing environment and test a variety of situations. Our results suggest that evaluating classifiers on such tasks is not straightforward since the changed environment can yield a simpler or more complex domain. We propose a metric capable of taking this issue into consideration and evaluate our classifiers using it. We conclude that in mild cases of population drifts simple classifiers deteriorate more than complex ones and that in more severe cases as well as in class definition changes, all classifiers deteriorate to about the same extent. This means that in all cases, complex classifiers remain more accurate than simpler ones, thus challenging the hypothesis that simple classifiers are more robust to changing environments than complex ones. 1 Introduction