2 research outputs found
Recommended from our members
A heterogeneous online learning ensemble for non-stationary environments
Learning in non-stationary environments is a challenging task which requires the updating of predictive models to deal with changes in the underlying probability distribution of the problem, i.e., dealing with concept drift. Most work in this area is concerned with updating the learning system so that it can quickly recover from concept drift, while little work has been dedicated to investigating what type of predictive model is most suitable at any given time. This paper aims to investigate the benefits of online model selection for predictive modelling in non-stationary environments. A novel heterogeneous ensemble approach is proposed to intelligently switch between different types of base models in an ensemble to increase the predictive performance of online learning in non-stationary environments. This approach is Heterogeneous Dynamic Weighted Majority (HDWM). It makes use of âseedâ learners of different types to maintain ensemble diversity, overcoming problems of existing dynamic ensembles that may undergo loss of diversity due to the exclusion of base learners. The algorithm has been evaluated on artificial and real-world data streams against existing well-known approaches such as a heterogeneous Weighted Majority Algorithm (WMA) and a homogeneous Dynamic Weighted Majority (DWM). The results show that HDWM performed significantly better than WMA in non-stationary environments. Also, when recurring concept drifts were present, the predictive performance of HDWM showed an improvement over DWM
Recommended from our members
Online semi-supervised learning in non-stationary environments
Existing Data Stream Mining (DSM) algorithms assume the availability of labelled and
balanced data, immediately or after some delay, to extract worthwhile knowledge from the
continuous and rapid data streams. However, in many real-world applications such as
Robotics, Weather Monitoring, Fraud Detection Systems, Cyber Security, and Computer
Network Traffic Flow, an enormous amount of high-speed data is generated by Internet of
Things sensors and real-time data on the Internet. Manual labelling of these data streams
is not practical due to time consumption and the need for domain expertise. Another
challenge is learning under Non-Stationary Environments (NSEs), which occurs due to
changes in the data distributions in a set of input variables and/or class labels. The problem
of Extreme Verification Latency (EVL) under NSEs is referred to as Initially Labelled Non-Stationary Environment (ILNSE). This is a challenging task because the learning algorithms
have no access to the true class labels directly when the concept evolves. Several approaches
exist that deal with NSE and EVL in isolation. However, few algorithms address both issues
simultaneously. This research directly responds to ILNSEâs challenge in proposing two
novel algorithms âPredictor for Streaming Data with Scarce Labelsâ (PSDSL) and
Heterogeneous Dynamic Weighted Majority (HDWM) classifier. PSDSL is an Online Semi-Supervised Learning (OSSL) method for real-time DSM and is closely related to label
scarcity issues in online machine learning.
The key capabilities of PSDSL include learning from a small amount of labelled data in an
incremental or online manner and being available to predict at any time. To achieve this,
PSDSL utilises both labelled and unlabelled data to train the prediction models, meaning it
continuously learns from incoming data and updates the model as new labelled or
unlabelled data becomes available over time. Furthermore, it can predict under NSE
conditions under the scarcity of class labels. PSDSL is built on top of the HDWM classifier,
which preserves the diversity of the classifiers. PSDSL and HDWM can intelligently switch
and adapt to the conditions. The PSDSL adapts to learning states between self-learning,
micro-clustering and CGC, whichever approach is beneficial, based on the characteristics of
the data stream. HDWM makes use of âseedâ learners of different types in an ensemble to
maintain its diversity. The ensembles are simply the combination of predictive models
grouped to improve the predictive performance of a single classifier.
PSDSL is empirically evaluated against COMPOSE, LEVELIW, SCARGC and MClassification
on benchmarks, NSE datasets as well as Massive Online Analysis (MOA) data streams and real-world datasets. The results showed that PSDSL performed significantly better than
existing approaches on most real-time data streams including randomised data instances.
PSDSL performed significantly better than âStaticâ i.e. the classifier is not updated after it is
trained with the first examples in the data streams. When applied to MOA-generated data
streams, PSDSL ranked highest (1.5) and thus performed significantly better than SCARGC,
while SCARGC performed the same as the Static. PSDSL achieved better average prediction
accuracies in a short time than SCARGC.
The HDWM algorithm is evaluated on artificial and real-world data streams against existing
well-known approaches such as the heterogeneous WMA and the homogeneous Dynamic
DWM algorithm. The results showed that HDWM performed significantly better than WMA
and DWM. Also, when recurring concept drifts were present, the predictive performance of
HDWM showed an improvement over DWM. In both drift and real-world streams,
significance tests and post hoc comparisons found significant differences between
algorithms, HDWM performed significantly better than DWM and WMA when applied to
MOA data streams and 4 real-world datasets Electric, Spam, Sensor and Forest cover. The
seeding mechanism and dynamic inclusion of new base learners in the HDWM algorithms
benefit from the use of both forgetting and retaining the models. The algorithm also
provides the independence of selecting the optimal base classifier in its ensemble depending
on the problem.
A new approach, Envelope-Clustering is introduced to resolve the cluster overlap conflicts
during the cluster labelling process. In this process, PSDSL transforms the centroidsâ
information of micro-clusters into micro-instances and generates new clusters called
Envelopes. The nearest envelope clusters assist the conflicted micro-clusters and
successfully guide the cluster labelling process after the concept drifts in the absence of true
class labels. PSDSL has been evaluated on real-world problem âkeystroke dynamicsâ, and
the results show that PSDSL achieved higher prediction accuracy (85.3%) and SCARGC
(81.6%), while the Static (49.0%) significantly degrades the performance due to changes in
the users typing pattern. Furthermore, the predictive accuracies of SCARGC are found
highly fluctuated between (41.1% to 81.6%) based on different values of parameter âkâ
(number of clusters), while PSDSL automatically determine the best values for this
parameter