245 research outputs found

    A Comparison of Ensemble and Case-Base Maintenance Techniques for Handling Concept Drift in Spam Filtering

    Get PDF
    The problem of concept drift has recently received con- siderable attention in machine learning research. One important practical problem where concept drift needs to be addressed is spam filtering. The literature on con- cept drift shows that among the most promising ap- proaches are ensembles and a variety of techniques for ensemble construction has been proposed. In this pa- per we compare the ensemble approach to an alternative lazy learning approach to concept drift whereby a sin- gle case-based classifier for spam filtering keeps itself up-to-date through a case-base maintenance protocol. We present an evaluation that shows that the case-base maintenance approach is more effective than a selection of ensemble techniques. The evaluation is complicated by the overriding importance of False Positives (FPs) in spam filtering. The ensemble approaches can have very good performance on FPs because it is possible to bias an ensemble more strongly away from FPs than it is to bias the single classifer. However this comes at consid- erable cost to the overall accurac

    A concept drift-tolerant case-base editing technique

    Full text link
    © 2015 Elsevier B.V. All rights reserved. The evolving nature and accumulating volume of real-world data inevitably give rise to the so-called "concept drift" issue, causing many deployed Case-Based Reasoning (CBR) systems to require additional maintenance procedures. In Case-base Maintenance (CBM), case-base editing strategies to revise the case-base have proven to be effective instance selection approaches for handling concept drift. Motivated by current issues related to CBR techniques in handling concept drift, we present a two-stage case-base editing technique. In Stage 1, we propose a Noise-Enhanced Fast Context Switch (NEFCS) algorithm, which targets the removal of noise in a dynamic environment, and in Stage 2, we develop an innovative Stepwise Redundancy Removal (SRR) algorithm, which reduces the size of the case-base by eliminating redundancies while preserving the case-base coverage. Experimental evaluations on several public real-world datasets show that our case-base editing technique significantly improves accuracy compared to other case-base editing approaches on concept drift tasks, while preserving its effectiveness on static tasks

    Applying lazy learning algorithms to tackle concept drift in spam filtering

    Get PDF
    A great amount of machine learning techniques have been applied to problems where data is collected over an extended period of time. However, the disadvantage with many real-world applications is that the distribution underlying the data is likely to change over time. In these situations, a problem that many global eager learners face is their inability to adapt to local concept drift. Concept drift in spam is particularly difficult as the spammers actively change the nature of their messages to elude spam filters. Algorithms that track concept drift must be able to identify a change in the target concept (spam or legitimate e-mails) without direct knowledge of the underlying shift in distribution. In this paper we show how a previously successful instance-based reasoning e-mail filtering model can be improved in order to better track concept drift in spam domain. Our proposal is based on the definition of two complementary techniques able to select both terms and e-mails representative of the current situation. The enhanced system is evaluated against other well-known successful lazy learning approaches in two scenarios, all within a cost-sensitive framework. The results obtained from the experiments carried out are very promising and back up the idea that instance-based reasoning systems can offer a number of advantages tackling concept drift in dynamic problems, as in the case of the anti-spam filtering domain

    How to Cope with Change? - Preserving Validity of Predictive Services over Time

    Get PDF
    Companies more and more rely on predictive services which are constantly monitoring and analyzing the available data streams for better service offerings. However, sudden or incremental changes in those streams are a challenge for the validity and proper functionality of the predictive service over time. We develop a framework which allows to characterize and differentiate predictive services with regard to their ongoing validity. Furthermore, this work proposes a research agenda of worthwhile research topics to improve the long-term validity of predictive services. In our work, we especially focus on different scenarios of true label availability for predictive services as well as the integration of expert knowledge. With these insights at hand, we lay an important foundation for future research in the field of valid predictive services

    A review of spam email detection: analysis of spammer strategies and the dataset shift problem

    Get PDF
    .Spam emails have been traditionally seen as just annoying and unsolicited emails containing advertisements, but they increasingly include scams, malware or phishing. In order to ensure the security and integrity for the users, organisations and researchers aim to develop robust filters for spam email detection. Recently, most spam filters based on machine learning algorithms published in academic journals report very high performance, but users are still reporting a rising number of frauds and attacks via spam emails. Two main challenges can be found in this field: (a) it is a very dynamic environment prone to the dataset shift problem and (b) it suffers from the presence of an adversarial figure, i.e. the spammer. Unlike classical spam email reviews, this one is particularly focused on the problems that this constantly changing environment poses. Moreover, we analyse the different spammer strategies used for contaminating the emails, and we review the state-of-the-art techniques to develop filters based on machine learning. Finally, we empirically evaluate and present the consequences of ignoring the matter of dataset shift in this practical field. Experimental results show that this shift may lead to severe degradation in the estimated generalisation performance, with error rates reaching values up to 48.81%.SIPublicación en abierto financiada por el Consorcio de Bibliotecas Universitarias de Castilla y León (BUCLE), con cargo al Programa Operativo 2014ES16RFOP009 FEDER 2014-2020 DE CASTILLA Y LEÓN, Actuación:20007-CL - Apoyo Consorcio BUCL

    SpamHunting: An instance-based reasoning system for spam labelling and filtering

    Get PDF
    n this paper we show an instance-based reasoning e-mail filtering model that outperforms classical machine learning techniques and other successful lazy learners approaches in the domain of anti-spam filtering. The architecture of the learning-based anti-spam filter is based on a tuneable en-hanced instance retrieval network able to accurately generalize e-mail representations. The reuse of similar messages is carried out by a simple unanimous voting mechanism to determine whether the tar-get case is spam or not. Previous to the final response of the system, the revision stage is only performed when the assigned class is spam whereby the system employs general knowledge in the form of meta-rules

    Classifying distinct data types: textual streams protein sequences and genomic variants

    Get PDF
    Artificial Intelligence (AI) is an interdisciplinary field combining different research areas with the end goal to automate processes in the everyday life and industry. The fundamental components of AI models are an “intelligent” model and a functional component defined by the end-application. That is, an intelligent model can be a statistical model that can recognize patterns in data instances to distinguish differences in between these instances. For example, if the AI is applied in car manufacturing, based on an image of a part of a car, the model can categorize if the car part is in the front, middle or rear compartment of the car, as a human brain would do. For the same example application, the statistical model informs a mechanical arm, the functional component, for the current car compartment and the arm in turn assembles this compartment, of the car, based on predefined instructions, likely as a human hand would follow human brain neural signals. A crucial step of AI applications is the classification of input instances by the intelligent model. The classification step in the intelligent model pipeline allows the subsequent steps to act in similar fashion for instances belonging to the same category. We define as classification the module of the intelligent model, which categorizes the input instances based on predefined human-expert or data-driven produced patterns of the instances. Irrespectively of the method to find patterns in data, classification is composed of four distinct steps: (i) input representation, (ii) model building (iii) model prediction and (iv) model assessment. Based on these classification steps, we argue that applying classification on distinct data types holds different challenges. In this thesis, I focus on challenges for three distinct classification scenarios: (i) Textual Streams: how to advance the model building step, commonly used for static distribution of data, to classify textual posts with transient data distribution? (ii) Protein Prediction: which biologically meaningful information can be used in the input representation step to overcome the limited training data challenge? (iii) Human Variant Pathogenicity Prediction: how to develop a classification system for functional impact of human variants, by providing standardized and well accepted evidence for the classification outcome and thus enabling the model assessment step? To answer these research questions, I present my contributions in classifying these different types of data: temporalMNB: I adapt the sequential prediction with expert advice paradigm to optimally aggregate complementary distributions to enhance a Naive Bayes model to adapt on drifting distribution of the characteristics of the textual posts. dom2vec: our proposal to learn embedding vectors for the protein domains using self-supervision. Based on the high performance achieved by the dom2vec embeddings in quantitative intrinsic assessment on the captured biological information, I provide example evidence for an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Last, I describe GenOtoScope bioinformatics software tool to automate standardized evidence-based criteria for pathogenicity impact of variants associated with hearing loss. Finally, to increase the practical use of our last contribution, I develop easy-to-use software interfaces to be used, in research settings, by clinical diagnostics personnel.Künstliche Intelligenz (KI) ist ein interdisziplinäres Gebiet, das verschiedene Forschungsbereiche mit dem Ziel verbindet, Prozesse im Alltag und in der Industrie zu automatisieren. Die grundlegenden Komponenten von KI-Modellen sind ein “intelligentes” Modell und eine durch die Endanwendung definierte funktionale Komponente. Das heißt, ein intelligentes Modell kann ein statistisches Modell sein, das Muster in Dateninstanzen erkennen kann, um Unterschiede zwischen diesen Instanzen zu unterscheiden. Wird die KI beispielsweise in der Automobilherstellung eingesetzt, kann das Modell auf der Grundlage eines Bildes eines Autoteils kategorisieren, ob sich das Autoteil im vorderen, mittleren oder hinteren Bereich des Autos befindet, wie es ein menschliches Gehirn tun würde. Bei der gleichen Beispielanwendung informiert das statistische Modell einen mechanischen Arm, die funktionale Komponente, über den aktuellen Fahrzeugbereich, und der Arm wiederum baut diesen Bereich des Fahrzeugs auf der Grundlage vordefinierter Anweisungen zusammen, so wie eine menschliche Hand den neuronalen Signalen des menschlichen Gehirns folgen würde. Ein entscheidender Schritt bei KI-Anwendungen ist die Klassifizierung von Eingabeinstanzen durch das intelligente Modell. Unabhängig von der Methode zum Auffinden von Mustern in Daten besteht die Klassifizierung aus vier verschiedenen Schritten: (i) Eingabedarstellung, (ii) Modellbildung, (iii) Modellvorhersage und (iv) Modellbewertung. Ausgehend von diesen Klassifizierungsschritten argumentiere ich, dass die Anwendung der Klassifizierung auf verschiedene Datentypen unterschiedliche Herausforderungen mit sich bringt. In dieser Arbeit konzentriere ich uns auf die Herausforderungen für drei verschiedene Klassifizierungsszenarien: (i) Textdatenströme: Wie kann der Schritt der Modellerstellung, der üblicherweise für eine statische Datenverteilung verwendet wird, weiterentwickelt werden, um die Klassifizierung von Textbeiträgen mit einer instationären Datenverteilung zu erlernen? (ii) Proteinvorhersage: Welche biologisch sinnvollen Informationen können im Schritt der Eingabedarstellung verwendet werden, um die Herausforderung der begrenzten Trainingsdaten zu überwinden? (iii) Vorhersage der Pathogenität menschlicher Varianten: Wie kann ein Klassifizierungssystem für die funktionellen Auswirkungen menschlicher Varianten entwickelt werden, indem standardisierte und anerkannte Beweise für das Klassifizierungsergebnis bereitgestellt werden und somit der Schritt der Modellbewertung ermöglicht wird? Um diese Forschungsfragen zu beantworten, stelle ich meine Beiträge zur Klassifizierung dieser verschiedenen Datentypen vor: temporalMNB: Verbesserung des Naive-Bayes-Modells zur Klassifizierung driftender Textströme durch Ensemble-Lernen. dom2vec: Lernen von Einbettungsvektoren für Proteindomänen durch Selbstüberwachung. Auf der Grundlage der berichteten Ergebnisse liefere ich Beispiele für eine Analogie zwischen den lokalen linguistischen Merkmalen in natürlichen Sprachen und den Domänenstruktur- und Funktionsinformationen in Domänenarchitekturen. Schließlich beschreibe ich ein bioinformatisches Softwaretool, GenOtoScope, zur Automatisierung standardisierter evidenzbasierter Kriterien für die orthogenitätsauswirkungen von Varianten, die mit angeborener Schwerhörigkeit in Verbindung stehen
    corecore