15 research outputs found

    A modified Learn++.NSE algorithm for dealing with concept drift

    Full text link
    © 2014 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. Concept drift is a very pervasive phenomenon in real world applications. By virtue of variety change types of concept drift, it makes more difficult for learning algorithm to track the concept drift very closely. Learn++.NSE is an incremental ensemble learner without any assumption on change type of concept drift. Even though it has good performance on handling concept drift, but it costs high computation and needs more time to recover from accuracy drop. This paper proposed a modified Learn++.NSE algorithm. During learning instances in data stream, our algorithm first identifies where and when drift happened, then uses instances accumulated by drift detection method to create a new base classifier, and finally organized all existing classifiers based on Learn++.NSE weighting mechanism to update ensemble learner. This modified algorithm can reduce high computation cost without any performance drop and improve the accuracy recover speed when drift happened

    Concept drift detection based on anomaly analysis

    Full text link
    © Springer International Publishing Switzerland 2014. In online machine learning, the ability to adapt to new concept quickly is highly desired. In this paper, we propose a novel concept drift detection method, which is called Anomaly Analysis Drift Detection (AADD), to improve the performance of machine learning algorithms under non-stationary environment. The proposed AADD method is based on an anomaly analysis of learner’s accuracy associate with the similarity between learners’ training domain and test data. This method first identifies whether there are conflicts between current concept and new coming data. Then the learner will incrementally learn the non conflict data, which will not decrease the accuracy of the learner on previous trained data, for concept extension. Otherwise, a new learner will be created based on the new data. Experiments illustrate that this AADD method can detect new concept quickly and learn extensional drift incrementally

    Email classification via intention-based segmentation

    Get PDF
    Email is the most popular way of personal and official communication among people and organizations. Due to untrusted virtual environment, email systems may face frequent attacks like malware, spamming, social engineering, etc. Spamming is the most common malicious activity, where unsolicited emails are sent in bulk, and these spam emails can be the source of malware, waste resources, hence degrade the productivity. In spam filter development, the most important challenge is to find the correlation between the nature of spam and the interest of the users because the interests of users are dynamic. This paper proposes a novel dynamic spam filter model that considers the changes in the interests of users with time while handling the spam activities. It uses intention-based segmentation to compare different segments of text documents instead of comparing them as a whole. The proposed spam filter is a multi-tier approach where initially, the email content is divided into segments with the help of part of speech (POS) tagging based on voices and tenses. Further, the segments are clustered using hierarchical clustering and compared using the vector space model. In the third stage, concept drift is detected in the clusters to identify the change in the interest of the user. Later, the classification of ham emails into various categories is done in the last stage. For experiments Enron dataset is used and the obtained results are promising

    MORPH: Towards Automated Concept Drift Adaptation for Malware Detection

    Full text link
    Concept drift is a significant challenge for malware detection, as the performance of trained machine learning models degrades over time, rendering them impractical. While prior research in malware concept drift adaptation has primarily focused on active learning, which involves selecting representative samples to update the model, self-training has emerged as a promising approach to mitigate concept drift. Self-training involves retraining the model using pseudo labels to adapt to shifting data distributions. In this research, we propose MORPH -- an effective pseudo-label-based concept drift adaptation method specifically designed for neural networks. Through extensive experimental analysis of Android and Windows malware datasets, we demonstrate the efficacy of our approach in mitigating the impact of concept drift. Our method offers the advantage of reducing annotation efforts when combined with active learning. Furthermore, our method significantly improves over existing works in automated concept drift adaptation for malware detection

    Fuzzy competence model drift detection for data-driven decision support systems

    Full text link
    © 2017 Elsevier B.V. This paper focuses on concept drift in business intelligence and data-driven decision support systems (DSSs). The assumption of a fixed distribution in the data renders conventional static DSSs inaccurate and unable to make correct decisions when concept drift occurs. However, it is important to know when, how, and where concept drift occurs so a DSS can adjust its decision processing knowledge to adapt to an ever-changing environment at the appropriate time. This paper presents a data distribution-based concept drift detection method called fuzzy competence model drift detection (FCM-DD). By introducing fuzzy sets theory and replacing crisp boundaries with fuzzy ones, we have improved the competence model to provide a better, more refined empirical distribution of the data stream. FCM-DD requires no prior knowledge of the underlying distribution and provides statistical guarantee of the reliability of the detected drift, based on the theory of bootstrapping. A series of experiments show that our proposed FCM-DD method can detect drift more accurately, has good sensitivity, and is robust

    Diagnostic Tool for Out-of-Sample Model Evaluation

    Full text link
    Assessment of model fitness is a key part of machine learning. The standard paradigm is to learn models by minimizing a chosen loss function averaged over training data, with the aim of achieving small losses on future data. In this paper, we consider the use of a finite calibration data set to characterize the future, out-of-sample losses of a model. We propose a simple model diagnostic tool that provides finite-sample guarantees under weak assumptions. The tool is simple to compute and to interpret. Several numerical experiments are presented to show how the proposed method quantifies the impact of distribution shifts, aids the analysis of regression, and enables model selection as well as hyper-parameter tuning.Comment: updates mainly for readability. some more experimental details in appendix. some connection to VaR added in discussio

    Discovering and forecasting interactions in big data research: A learning-enhanced bibliometric study

    Full text link
    © 2018 As one of the most impactful emerging technologies, big data analytics and its related applications are powering the development of information technologies and are significantly shaping thinking and behavior in today's interconnected world. Exploring the technological evolution of big data research is an effective way to enhance technology management and create value for research and development strategies for both government and industry. This paper uses a learning-enhanced bibliometric study to discover interactions in big data research by detecting and visualizing its evolutionary pathways. Concentrating on a set of 5840 articles derived from Web of Science covering the period between 2000 and 2015, text mining and bibliometric techniques are combined to profile the hotspots in big data research and its core constituents. A learning process is used to enhance the ability to identify the interactive relationships between topics in sequential time slices, revealing technological evolution and death. The outputs include a landscape of interactions within big data research from 2000 to 2015 with a detailed map of the evolutionary pathways of specific technologies. Empirical insights for related studies in science policy, innovation management, and entrepreneurship are also provided

    Reputation-based maintenance in case-based reasoning

    Get PDF
    Case Base Maintenance algorithms update the contents of a case base in order to improve case-based reasoner performance. In this paper, we introduce a new case base maintenance method called Reputation-Based Maintenance (RBM) with the aim of increasing the classification accuracy of a Case-Based Reasoning system while reducing the size of its case base. The proposed RBM algorithm calculates a case property called Reputationfor each member of the case base, the value of which reflects the competence of the related case. Based on this case property, several removal policies and maintenance methods have been designed, each focusing on different aspects of the case base maintenance. The performance of the RBM method was compared with well-known state-of-the-art algorithms. The tests were performed on 30 datasets selected from the UCI repository. The results show that the RBM method in all its variations achieves greater accuracy than a baseline CBR, while some variations significantly outperform the state-of-the-art methods. We particularly highlight theRBM_ACBR algorithm, which achieves the highest accuracy among the methods in the comparison to a statistically significant degree, and the RBMcr algorithm, which increases the baseline accuracy while removing, on average, over half of the case basehis work has been partially supported by the SpanishMinistry of Science and Innovation with project MISMIS-LANGUAGE (grantnumber PGC2018-096212-B-C33), by the Catalan Agency of University andResearch Grants Management (AGAUR) (grants number 2017 SGR 341 and 2017SGR 574), by Spanish Network ‘‘Learning Machines for Singular Problems andApplications (MAPAS)’’ (TIN2017-90567-REDT, MINECO/FEDER EU) and by theEuropean Union’s Horizon 2020 research and innovation programme under theMarie Sklodowska-Curie grant agreement No. 860843Peer ReviewedPostprint (author's final draft
    corecore