85 research outputs found

    Learning from Data Streams with Randomized Forests

    Get PDF
    Non-stationary streaming data poses a familiar challenge in machine learning: the need to obtain fast and accurate predictions. A data stream is a continuously generated sequence of data, with data typically arriving rapidly. They are often characterised by a non-stationary generative process, with concept drift occurring as the process changes. Such processes are commonly seen in the real world, such as in advertising, shopping trends, environmental conditions, electricity monitoring and traffic monitoring. Typical stationary algorithms are ill-suited for use with concept drifting data, thus necessitating more targeted methods. Tree-based methods are a popular approach to this problem, traditionally focussing on the use of the Hoeffding bound in order to guarantee performance relative to a stationary scenario. However, there are limited single learners available for regression scenarios, and those that do exist often struggle to choose between similarly discriminative splits, leading to longer training times and worse performance. This limited pool of single learners in turn hampers the performance of ensemble approaches in which they act as base learners. In this thesis we seek to remedy this gap in the literature, developing methods which focus on increasing randomization to both improve predictive performance and reduce the training times of tree-based ensemble methods. In particular, we have chosen to investigate the use of randomization as it is known to be able to improve generalization error in ensembles, and is also expected to lead to fast training times, thus being a natural method of handling the problems typically experienced by single learners. We begin in a regression scenario, introducing the Adaptive Trees for Streaming with Extreme Randomization (ATSER) algorithm; a partially randomized approach based on the concept of Extremely Randomized (extra) trees. The ATSER algorithm incrementally trains trees, using the Hoeffding bound to select the best of a random selection of splits. Simultaneously, the trees also detect and adapt to changes in the data stream. Unlike many traditional streaming algorithms ATSER trees can easily be extended to include nominal features. We find that compared to other contemporary methods ensembles of ATSER trees lead to improved predictive performance whilst also reducing run times. We then demonstrate the Adaptive Categorisation Trees for Streaming with Extreme Randomization (ACTSER) algorithm, an adaption of the ATSER algorithm to the more traditional categorization scenario, again showing improved predictive performance and reduced runtimes. The inclusion of nominal features is particularly novel in this setting since typical categorization approaches struggle to handle them. Finally we examine a completely randomized scenario, where an ensemble of trees is generated prior to having access to the data stream, while also considering multivariate splits in addition to the traditional axis-aligned approach. We find that through the combination of a forgetting mechanism in linear models and dynamic weighting for ensemble members, we are able to avoid explicitly testing for concept drift. This leads to fast ensembles with strong predictive performance, whilst also requiring fewer parameters than other contemporary methods. For each of the proposed methods in this thesis, we demonstrate empirically that they are effective over a variety of different non-stationary data streams, including on multiple types of concept drift. Furthermore, in comparison to other contemporary data streaming algorithms, we find the biggest improvements in performance are on noisy data streams.Engineers Gat

    Transfer Topic Labeling with Domain-Specific Knowledge Base: An Analysis of UK House of Commons Speeches 1935-2014

    Get PDF
    Topic models are widely used in natural language processing, allowing researchers to estimate the underlying themes in a collection of documents. Most topic models use unsupervised methods and hence require the additional step of attaching meaningful labels to estimated topics. This process of manual labeling is not scalable and suffers from human bias. We present a semi-automatic transfer topic labeling method that seeks to remedy these problems. Domain-specific codebooks form the knowledge-base for automated topic labeling. We demonstrate our approach with a dynamic topic model analysis of the complete corpus of UK House of Commons speeches 1935-2014, using the coding instructions of the Comparative Agendas Project to label topics. We show that our method works well for a majority of the topics we estimate; but we also find that institution-specific topics, in particular on subnational governance, require manual input. We validate our results using human expert coding

    Identifying and Alleviating Concept Drift in Streaming Tensor Decomposition

    Full text link
    Tensor decompositions are used in various data mining applications from social network to medical applications and are extremely useful in discovering latent structures or concepts in the data. Many real-world applications are dynamic in nature and so are their data. To deal with this dynamic nature of data, there exist a variety of online tensor decomposition algorithms. A central assumption in all those algorithms is that the number of latent concepts remains fixed throughout the entire stream. However, this need not be the case. Every incoming batch in the stream may have a different number of latent concepts, and the difference in latent concepts from one tensor batch to another can provide insights into how our findings in a particular application behave and deviate over time. In this paper, we define "concept" and "concept drift" in the context of streaming tensor decomposition, as the manifestation of the variability of latent concepts throughout the stream. Furthermore, we introduce SeekAndDestroy, an algorithm that detects concept drift in streaming tensor decomposition and is able to produce results robust to that drift. To the best of our knowledge, this is the first work that investigates concept drift in streaming tensor decomposition. We extensively evaluate SeekAndDestroy on synthetic datasets, which exhibit a wide variety of realistic drift. Our experiments demonstrate the effectiveness of SeekAndDestroy, both in the detection of concept drift and in the alleviation of its effects, producing results with similar quality to decomposing the entire tensor in one shot. Additionally, in real datasets, SeekAndDestroy outperforms other streaming baselines, while discovering novel useful components.Comment: 16 Pages, Accepted at ECML-PKDD 201

    Skewed Evolving Data Streams Classification with Actionable Knowledge Extraction using Data Approximation and Adaptive Classification Framework

    Get PDF
    Skewed evolving data stream (SEDS) classification is a challenging research problem for online streaming data applications. The fundamental challenges in streaming data classification are class imbalance and concept drift. However, recently, either independently or together, the two topics have received enough attention; the data redundancy while performing stream data mining and classification remains unexplored. Moreover, the existing solutions for the classification of SEDSs have focused on solving concept drift and/or class imbalance problems using the sliding window mechanism, which leads to higher computational complexity and data redundancy problems. To end this, we propose a novel Adaptive Data Stream Classification (ADSC) framework for solving the concept drift, class imbalance, and data redundancy problems with higher computational and classification efficiency. Data approximation, adaptive clustering, classification, and actionable knowledge extraction are the major phases of ADSC. For the purpose of approximating unique items in the data stream with data pre-processing during the data approximation phase, we develop the Flajolet Martin (FM) algorithm. The periodically approximated tuples are grouped into distinct classes using an adaptive clustering algorithm to address the problem of concept drift and class imbalance. In the classification phase, the supervised classifiers are employed to classify the unknown incoming data streams into either of the classes discovered by the adaptive clustering algorithm. We then extract the actionable knowledge using classified skewed evolved data stream information for the end user decision-making process. The ADSC framework is empirically assessed utilizing two streaming datasets regarding classification and computing efficiency factors. The experimental results shows the better efficiency of the proposed ADSC framework as compared with existing classification methods

    Towards Supercomputing Categorizing the Maliciousness upon Cybersecurity Blacklists with Concept Drift

    Get PDF
    [EN] In this article, we have carried out a case study to optimize the classification of the maliciousness of cybersecurity events by IP addresses using machine learning techniques. The optimization is studied focusing on time complexity. Firstly, we have used the extreme gradient boosting model, and secondly, we have parallelized the machine learning algorithm to study the effect of using a different number of cores for the problem. We have classified the cybersecurity events' maliciousness in a biclass and a multiclass scenario. All the experiments have been carried out with a well-known optimal set of features: the geolocation information of the IP address. However, the geolocation features of an IP address can change over time. Also, the relation between the IP address and its label of maliciousness can be modified if we test the address several times. Then, the models' performance could degrade because the information acquired from training on past samples may not generalize well to new samples. This situation is known as concept drift. For this reason, it is necessary to study if the optimization proposed works in a concept drift scenario. The results show that the concept drift does not degrade the models. Also, boosting algorithms achieving competitive or better performance compared to similar research works for the biclass scenario and an effective categorization for the multiclass case. The best efficient setting is reached using five nodes regarding high-performance computation resources.SIInstituto Nacional de SeguridadPartial support was received from the Spanish National Cybersecurity Institute (INCIBE) under the contract art (83, 203 key: X54

    Robust Machine Learning for Malware Detection over Time

    Get PDF
    The presence and persistence of Android malware is an on-going threat that plagues this information era, and machine learning technologies are now extensively used to deploy more effective detectors that can block the majority of these malicious programs. However, these algorithms have not been developed to pursue the natural evolution of malware, and their performances significantly degrade over time because of such concept-drift. Currently, state-of-the-art techniques only focus on detecting the presence of such drift, or they address it by relying on frequent updates of models. Hence, there is a lack of knowledge regarding the cause of the concept drift, and ad-hoc solutions that can counter the passing of time are still underinvestigated. In this work, we commence to address these issues as we propose (i) a drift-analysis framework to identify which characteristics of data are causing the drift, and (ii) SVM-CB, a time-aware classifier that leverages the drift-analysis information to slow down the performance drop. We highlight the efficacy of our contribution by comparing its degradation over time with a state-of-the-art classifier, and we show that SVM-CB better withstand the distribution changes that naturally characterizes the malware domain. We conclude by discussing the limitations of our approach and how our contribution can be taken as a first step towards more time-resistant classifiers that not only tackle, but also understand the concept drift that affect data
    • …
    corecore