85 research outputs found
Learning from Data Streams with Randomized Forests
Non-stationary streaming data poses a familiar challenge in machine learning: the need to
obtain fast and accurate predictions. A data stream is a continuously generated sequence of
data, with data typically arriving rapidly. They are often characterised by a non-stationary
generative process, with concept drift occurring as the process changes. Such processes are
commonly seen in the real world, such as in advertising, shopping trends, environmental
conditions, electricity monitoring and traffic monitoring.
Typical stationary algorithms are ill-suited for use with concept drifting data, thus necessitating
more targeted methods. Tree-based methods are a popular approach to this problem,
traditionally focussing on the use of the Hoeffding bound in order to guarantee performance
relative to a stationary scenario. However, there are limited single learners available for
regression scenarios, and those that do exist often struggle to choose between similarly
discriminative splits, leading to longer training times and worse performance. This limited
pool of single learners in turn hampers the performance of ensemble approaches in which
they act as base learners.
In this thesis we seek to remedy this gap in the literature, developing methods which
focus on increasing randomization to both improve predictive performance and reduce the
training times of tree-based ensemble methods. In particular, we have chosen to investigate
the use of randomization as it is known to be able to improve generalization error in
ensembles, and is also expected to lead to fast training times, thus being a natural method
of handling the problems typically experienced by single learners.
We begin in a regression scenario, introducing the Adaptive Trees for Streaming with
Extreme Randomization (ATSER) algorithm; a partially randomized approach based on
the concept of Extremely Randomized (extra) trees. The ATSER algorithm incrementally
trains trees, using the Hoeffding bound to select the best of a random selection of splits.
Simultaneously, the trees also detect and adapt to changes in the data stream. Unlike many
traditional streaming algorithms ATSER trees can easily be extended to include nominal
features. We find that compared to other contemporary methods ensembles of ATSER
trees lead to improved predictive performance whilst also reducing run times.
We then demonstrate the Adaptive Categorisation Trees for Streaming with Extreme
Randomization (ACTSER) algorithm, an adaption of the ATSER algorithm to the more
traditional categorization scenario, again showing improved predictive performance and
reduced runtimes. The inclusion of nominal features is particularly novel in this setting
since typical categorization approaches struggle to handle them.
Finally we examine a completely randomized scenario, where an ensemble of trees is generated
prior to having access to the data stream, while also considering multivariate splits
in addition to the traditional axis-aligned approach. We find that through the combination
of a forgetting mechanism in linear models and dynamic weighting for ensemble members,
we are able to avoid explicitly testing for concept drift. This leads to fast ensembles
with strong predictive performance, whilst also requiring fewer parameters than other
contemporary methods.
For each of the proposed methods in this thesis, we demonstrate empirically that they are
effective over a variety of different non-stationary data streams, including on multiple
types of concept drift. Furthermore, in comparison to other contemporary data streaming
algorithms, we find the biggest improvements in performance are on noisy data streams.Engineers Gat
Transfer Topic Labeling with Domain-Specific Knowledge Base: An Analysis of UK House of Commons Speeches 1935-2014
Topic models are widely used in natural language processing, allowing
researchers to estimate the underlying themes in a collection of documents.
Most topic models use unsupervised methods and hence require the additional
step of attaching meaningful labels to estimated topics. This process of manual
labeling is not scalable and suffers from human bias. We present a
semi-automatic transfer topic labeling method that seeks to remedy these
problems. Domain-specific codebooks form the knowledge-base for automated topic
labeling. We demonstrate our approach with a dynamic topic model analysis of
the complete corpus of UK House of Commons speeches 1935-2014, using the coding
instructions of the Comparative Agendas Project to label topics. We show that
our method works well for a majority of the topics we estimate; but we also
find that institution-specific topics, in particular on subnational governance,
require manual input. We validate our results using human expert coding
Identifying and Alleviating Concept Drift in Streaming Tensor Decomposition
Tensor decompositions are used in various data mining applications from
social network to medical applications and are extremely useful in discovering
latent structures or concepts in the data. Many real-world applications are
dynamic in nature and so are their data. To deal with this dynamic nature of
data, there exist a variety of online tensor decomposition algorithms. A
central assumption in all those algorithms is that the number of latent
concepts remains fixed throughout the entire stream. However, this need not be
the case. Every incoming batch in the stream may have a different number of
latent concepts, and the difference in latent concepts from one tensor batch to
another can provide insights into how our findings in a particular application
behave and deviate over time. In this paper, we define "concept" and "concept
drift" in the context of streaming tensor decomposition, as the manifestation
of the variability of latent concepts throughout the stream. Furthermore, we
introduce SeekAndDestroy, an algorithm that detects concept drift in streaming
tensor decomposition and is able to produce results robust to that drift. To
the best of our knowledge, this is the first work that investigates concept
drift in streaming tensor decomposition. We extensively evaluate SeekAndDestroy
on synthetic datasets, which exhibit a wide variety of realistic drift. Our
experiments demonstrate the effectiveness of SeekAndDestroy, both in the
detection of concept drift and in the alleviation of its effects, producing
results with similar quality to decomposing the entire tensor in one shot.
Additionally, in real datasets, SeekAndDestroy outperforms other streaming
baselines, while discovering novel useful components.Comment: 16 Pages, Accepted at ECML-PKDD 201
Skewed Evolving Data Streams Classification with Actionable Knowledge Extraction using Data Approximation and Adaptive Classification Framework
Skewed evolving data stream (SEDS) classification is a challenging research problem for online streaming data applications. The fundamental challenges in streaming data classification are class imbalance and concept drift. However, recently, either independently or together, the two topics have received enough attention; the data redundancy while performing stream data mining and classification remains unexplored. Moreover, the existing solutions for the classification of SEDSs have focused on solving concept drift and/or class imbalance problems using the sliding window mechanism, which leads to higher computational complexity and data redundancy problems. To end this, we propose a novel Adaptive Data Stream Classification (ADSC) framework for solving the concept drift, class imbalance, and data redundancy problems with higher computational and classification efficiency. Data approximation, adaptive clustering, classification, and actionable knowledge extraction are the major phases of ADSC. For the purpose of approximating unique items in the data stream with data pre-processing during the data approximation phase, we develop the Flajolet Martin (FM) algorithm. The periodically approximated tuples are grouped into distinct classes using an adaptive clustering algorithm to address the problem of concept drift and class imbalance. In the classification phase, the supervised classifiers are employed to classify the unknown incoming data streams into either of the classes discovered by the adaptive clustering algorithm. We then extract the actionable knowledge using classified skewed evolved data stream information for the end user decision-making process. The ADSC framework is empirically assessed utilizing two streaming datasets regarding classification and computing efficiency factors. The experimental results shows the better efficiency of the proposed ADSC framework as compared with existing classification methods
Towards Supercomputing Categorizing the Maliciousness upon Cybersecurity Blacklists with Concept Drift
[EN] In this article, we have carried out a case study to optimize the classification of the maliciousness of cybersecurity events by IP addresses using machine learning techniques. The optimization is studied focusing on time complexity. Firstly, we have used the extreme gradient boosting model, and secondly, we have parallelized the machine learning algorithm to study the effect of using a different number of cores for the problem. We have classified the cybersecurity events' maliciousness in a biclass and a multiclass scenario. All the experiments have been carried out with a well-known optimal set of features: the geolocation information of the IP address. However, the geolocation features of an IP address can change over time. Also, the relation between the IP address and its label of maliciousness can be modified if we test the address several times. Then, the models' performance could degrade because the information acquired from training on past samples may not generalize well to new samples. This situation is known as concept drift. For this reason, it is necessary to study if the optimization proposed works in a concept drift scenario. The results show that the concept drift does not degrade the models. Also, boosting algorithms achieving competitive or better performance compared to similar research works for the biclass scenario and an effective categorization for the multiclass case. The best efficient setting is reached using five nodes regarding high-performance computation resources.SIInstituto Nacional de SeguridadPartial support was received from the Spanish National Cybersecurity Institute (INCIBE) under the contract art (83, 203 key: X54
Robust Machine Learning for Malware Detection over Time
The presence and persistence of Android malware is an on-going threat that plagues this information era, and machine learning technologies are now extensively used to deploy more effective detectors that can block the majority of these malicious programs. However, these algorithms have not been developed to pursue the natural evolution of malware, and their performances significantly degrade over time because of such concept-drift.
Currently, state-of-the-art techniques only focus on detecting the presence of such drift, or they address it by relying on frequent updates of models. Hence, there is a lack of knowledge regarding the cause of the concept drift, and ad-hoc solutions that can counter the passing of time are still underinvestigated.
In this work, we commence to address these issues as we propose (i) a drift-analysis framework to identify which characteristics of data are causing the drift, and (ii) SVM-CB, a time-aware classifier that leverages the drift-analysis information to slow down the performance drop. We highlight the efficacy of our contribution by comparing its degradation over time with a state-of-the-art classifier, and we show that SVM-CB better withstand the distribution changes that naturally characterizes the malware domain.
We conclude by discussing the limitations of our approach and how our contribution can be taken as a first step towards more time-resistant classifiers that not only tackle, but also understand the concept drift that affect data
- …