67 research outputs found
Process-Oriented Stream Classification Pipeline:A Literature Review
Featured Application: Nowadays, many applications and disciplines work on the basis of stream data. Common examples are the IoT sector (e.g., sensor data analysis), or video, image, and text analysis applications (e.g., in social media analytics or astronomy). With our work, we gather different approaches and terminology, and give a broad overview over the topic. Our main target groups are practitioners and newcomers to the field of data stream classification. Due to the rise of continuous data-generating applications, analyzing data streams has gained increasing attention over the past decades. A core research area in stream data is stream classification, which categorizes or detects data points within an evolving stream of observations. Areas of stream classification are diverseâranging, e.g., from monitoring sensor data to analyzing a wide range of (social) media applications. Research in stream classification is related to developing methods that adapt to the changing and potentially volatile data stream. It focuses on individual aspects of the stream classification pipeline, e.g., designing suitable algorithm architectures, an efficient train and test procedure, or detecting so-called concept drifts. As a result of the many different research questions and strands, the field is challenging to grasp, especially for beginners. This survey explores, summarizes, and categorizes work within the domain of stream classification and identifies core research threads over the past few years. It is structured based on the stream classification process to facilitate coordination within this complex topic, including common application scenarios and benchmarking data sets. Thus, both newcomers to the field and experts who want to widen their scope can gain (additional) insight into this research area and find starting points and pointers to more in-depth literature on specific issues and research directions in the field.</p
Learning in Dynamic Data-Streams with a Scarcity of Labels
Analysing data in real-time is a natural and necessary progression from traditional data mining. However, real-time analysis presents additional challenges to batch-analysis; along with strict time and memory constraints, change is a major consideration. In a dynamic stream there is an assumption that the underlying process generating the stream is non-stationary and that concepts within the stream will drift and change over time. Adopting a false assumption that a stream is stationary will result in non-adaptive models degrading and eventually becoming obsolete. The challenge of recognising and reacting to change in a stream is compounded by the scarcity of labels problem. This refers to the very realistic situation in which the true class label of an incoming point is not immediately available (or will never be available) or in situations where manually labelling incoming points is prohibitively expensive. The goal of this thesis is to evaluate unsupervised learning as the basis for online classification in dynamic data-streams with a scarcity of labels. To realise this goal, a novel stream clustering algorithm based on the collective behaviour of ants (Ant Colony Stream Clustering (ACSC)) is proposed. This algorithm is shown to be faster and more accurate than comparative, peer stream-clustering algorithms while requiring fewer sensitive parameters. The principles of ACSC are extended in a second stream-clustering algorithm named Multi-Density Stream Clustering (MDSC). This algorithm has adaptive parameters and crucially, can track clusters and monitor their dynamic behaviour over time. A novel technique called a Dynamic Feature Mask (DFM) is proposed to ``sit on topââ of these stream-clustering algorithms and can be used to observe and track change at the feature level in a data stream. This Feature Mask acts as an unsupervised feature selection method allowing high-dimensional streams to be clustered. Finally, data-stream clustering is evaluated as an approach to one-class classification and a novel framework (named COCEL: Clustering and One class Classification Ensemble Learning) for classification in dynamic streams with a scarcity of labels is described. The proposed framework can identify and react to change in a stream and hugely reduces the number of required labels (typically less than 0.05% of the entire stream)
Recommended from our members
Online semi-supervised learning in non-stationary environments
Existing Data Stream Mining (DSM) algorithms assume the availability of labelled and
balanced data, immediately or after some delay, to extract worthwhile knowledge from the
continuous and rapid data streams. However, in many real-world applications such as
Robotics, Weather Monitoring, Fraud Detection Systems, Cyber Security, and Computer
Network Traffic Flow, an enormous amount of high-speed data is generated by Internet of
Things sensors and real-time data on the Internet. Manual labelling of these data streams
is not practical due to time consumption and the need for domain expertise. Another
challenge is learning under Non-Stationary Environments (NSEs), which occurs due to
changes in the data distributions in a set of input variables and/or class labels. The problem
of Extreme Verification Latency (EVL) under NSEs is referred to as Initially Labelled Non-Stationary Environment (ILNSE). This is a challenging task because the learning algorithms
have no access to the true class labels directly when the concept evolves. Several approaches
exist that deal with NSE and EVL in isolation. However, few algorithms address both issues
simultaneously. This research directly responds to ILNSEâs challenge in proposing two
novel algorithms âPredictor for Streaming Data with Scarce Labelsâ (PSDSL) and
Heterogeneous Dynamic Weighted Majority (HDWM) classifier. PSDSL is an Online Semi-Supervised Learning (OSSL) method for real-time DSM and is closely related to label
scarcity issues in online machine learning.
The key capabilities of PSDSL include learning from a small amount of labelled data in an
incremental or online manner and being available to predict at any time. To achieve this,
PSDSL utilises both labelled and unlabelled data to train the prediction models, meaning it
continuously learns from incoming data and updates the model as new labelled or
unlabelled data becomes available over time. Furthermore, it can predict under NSE
conditions under the scarcity of class labels. PSDSL is built on top of the HDWM classifier,
which preserves the diversity of the classifiers. PSDSL and HDWM can intelligently switch
and adapt to the conditions. The PSDSL adapts to learning states between self-learning,
micro-clustering and CGC, whichever approach is beneficial, based on the characteristics of
the data stream. HDWM makes use of âseedâ learners of different types in an ensemble to
maintain its diversity. The ensembles are simply the combination of predictive models
grouped to improve the predictive performance of a single classifier.
PSDSL is empirically evaluated against COMPOSE, LEVELIW, SCARGC and MClassification
on benchmarks, NSE datasets as well as Massive Online Analysis (MOA) data streams and real-world datasets. The results showed that PSDSL performed significantly better than
existing approaches on most real-time data streams including randomised data instances.
PSDSL performed significantly better than âStaticâ i.e. the classifier is not updated after it is
trained with the first examples in the data streams. When applied to MOA-generated data
streams, PSDSL ranked highest (1.5) and thus performed significantly better than SCARGC,
while SCARGC performed the same as the Static. PSDSL achieved better average prediction
accuracies in a short time than SCARGC.
The HDWM algorithm is evaluated on artificial and real-world data streams against existing
well-known approaches such as the heterogeneous WMA and the homogeneous Dynamic
DWM algorithm. The results showed that HDWM performed significantly better than WMA
and DWM. Also, when recurring concept drifts were present, the predictive performance of
HDWM showed an improvement over DWM. In both drift and real-world streams,
significance tests and post hoc comparisons found significant differences between
algorithms, HDWM performed significantly better than DWM and WMA when applied to
MOA data streams and 4 real-world datasets Electric, Spam, Sensor and Forest cover. The
seeding mechanism and dynamic inclusion of new base learners in the HDWM algorithms
benefit from the use of both forgetting and retaining the models. The algorithm also
provides the independence of selecting the optimal base classifier in its ensemble depending
on the problem.
A new approach, Envelope-Clustering is introduced to resolve the cluster overlap conflicts
during the cluster labelling process. In this process, PSDSL transforms the centroidsâ
information of micro-clusters into micro-instances and generates new clusters called
Envelopes. The nearest envelope clusters assist the conflicted micro-clusters and
successfully guide the cluster labelling process after the concept drifts in the absence of true
class labels. PSDSL has been evaluated on real-world problem âkeystroke dynamicsâ, and
the results show that PSDSL achieved higher prediction accuracy (85.3%) and SCARGC
(81.6%), while the Static (49.0%) significantly degrades the performance due to changes in
the users typing pattern. Furthermore, the predictive accuracies of SCARGC are found
highly fluctuated between (41.1% to 81.6%) based on different values of parameter âkâ
(number of clusters), while PSDSL automatically determine the best values for this
parameter
Adaptive Algorithms For Classification On High-Frequency Data Streams: Application To Finance
MenciĂłn Internacional en el tĂtulo de doctorIn recent years, the problem of concept drift has gained importance in the financial
domain. The succession of manias, panics and crashes have stressed the nonstationary
nature and the likelihood of drastic structural changes in financial markets.
The most recent literature suggests the use of conventional machine learning and statistical
approaches for this. However, these techniques are unable or slow to adapt
to non-stationarities and may require re-training over time, which is computationally
expensive and brings financial risks.
This thesis proposes a set of adaptive algorithms to deal with high-frequency data
streams and applies these to the financial domain. We present approaches to handle
different types of concept drifts and perform predictions using up-to-date models.
These mechanisms are designed to provide fast reaction times and are thus applicable
to high-frequency data. The core experiments of this thesis are based on the prediction
of the price movement direction at different intraday resolutions in the SPDR S&P 500
exchange-traded fund. The proposed algorithms are benchmarked against other popular
methods from the data stream mining literature and achieve competitive results.
We believe that this thesis opens good research prospects for financial forecasting
during market instability and structural breaks. Results have shown that our proposed
methods can improve prediction accuracy in many of these scenarios. Indeed, the
results obtained are compatible with ideas against the efficient market hypothesis.
However, we cannot claim that we can beat consistently buy and hold; therefore, we
cannot reject it.Programa de Doctorado en Ciencia y TecnologĂa InformĂĄtica por la Universidad Carlos III de MadridPresidente: Gustavo Recio Isasi.- Secretario: Pedro Isasi Viñuela.- Vocal: Sandra GarcĂa RodrĂgue
Adaptive Automated Machine Learning
The ever-growing demand for machine learning has led to the development of automated machine learning (AutoML) systems that can be used off the shelf by non-experts. Further, the demand for ML applications with high predictive performance exceeds the number of machine learning experts and makes the development of AutoML systems necessary. Automated Machine Learning tackles the problem of finding machine learning models with high predictive performance. Existing approaches incorporating deep learning techniques assume that all data is available at the beginning of the training process (offline learning). They configure and optimise a pipeline of preprocessing, feature engineering, and model selection by choosing suitable hyperparameters in each model pipeline step. Furthermore, they assume that the user is fully aware of the choice and, thus, the consequences of the underlying metric (such as precision, recall, or F1-measure). By variation of this metric, the search for suitable configurations and thus the adaptation of algorithms can be tailored to the userâs needs. With the creation of a vast amount of data from all kinds of sources every day, our capability to process and understand these data sets in a single batch is no longer viable. By training machine learning models incrementally (i.ex. online learning), the flood of data can be processed sequentially within data streams. However, if one assumes an online learning scenario, where an AutoML instance executes on evolving data streams, the question of the best model and its configuration remains open.
In this work, we address the adaptation of AutoML in an offline learning scenario toward a certain utility an end-user might pursue as well as the adaptation of AutoML towards evolving data streams in an online learning scenario with three main contributions:
1. We propose a System that allows the adaptation of AutoML and the search for neural architectures towards a particular utility an end-user might pursue.
2. We introduce an online deep learning framework that fosters the research of deep learning models under the online learning assumption and enables the automated search for neural architectures.
3. We introduce an online AutoML framework that allows the incremental adaptation of ML models.
We evaluate the contributions individually, in accordance with predefined requirements and to state-of-the- art evaluation setups. The outcomes lead us to conclude that (i) AutoML, as well as systems for neural architecture search, can be steered towards individual utilities by learning a designated ranking model from pairwise preferences and using the latter as the target function for the offline learning scenario; (ii) architectual small neural networks are in general suitable assuming an online learning scenario; (iii) the configuration of machine learning pipelines can be automatically be adapted to ever-evolving data streams and lead to better performances
Continual learning from stationary and non-stationary data
Continual learning aims at developing models that are capable of working on constantly evolving problems over a long-time horizon. In such environments, we can distinguish three essential aspects of training and maintaining machine learning models - incorporating new knowledge, retaining it and reacting to changes. Each of them poses its own challenges, constituting a compound problem with multiple goals.
Remembering previously incorporated concepts is the main property of a model that is required when dealing with stationary distributions. In non-stationary environments, models should be capable of selectively forgetting outdated decision boundaries and adapting to new concepts. Finally, a significant difficulty can be found in combining these two abilities within a single learning algorithm, since, in such scenarios, we have to balance remembering and forgetting instead of focusing only on one aspect.
The presented dissertation addressed these problems in an exploratory way. Its main goal was to grasp the continual learning paradigm as a whole, analyze its different branches and tackle identified issues covering various aspects of learning from sequentially incoming data. By doing so, this work not only filled several gaps in the current continual learning research but also emphasized the complexity and diversity of challenges existing in this domain. Comprehensive experiments conducted for all of the presented contributions have demonstrated their effectiveness and substantiated the validity of the stated claims
Adaptive classifier ensembles for face recognition in video-surveillance
Lors de lâimplĂ©mentation de systĂšmes de sĂ©curitĂ© tels que la vidĂ©o-surveillance intelligente, lâutilisation dâimages de visages prĂ©sente de nombreux avantages par rapport Ă dâautres traits biomĂ©triques. En particulier, cela permet de dĂ©tecter dâĂ©ventuels individus dâintĂ©rĂȘt de maniĂšre discrĂšte et non intrusive, ce qui peut ĂȘtre particuliĂšrement avantageux dans des situations comme la dĂ©tection dâindividus sur liste noire, la recherche dans des donnĂ©es archivĂ©es ou la rĂ©-identification de visages.
MalgrĂ© cela, la reconnaissance de visages reste confrontĂ©e Ă de nombreuses difficultĂ©s propres Ă la vidĂ©o surveillance. Entre autres, le manque de contrĂŽle sur lâenvironnement observĂ© implique de nombreuses variations dans les conditions dâĂ©clairage, la rĂ©solution de lâimage, le flou de mouvement, lâorientation et lâexpression des visages. Pour reconnaĂźtre des individus, des modĂšles de visages sont habituellement gĂ©nĂ©rĂ©s Ă lâaide dâun nombre limitĂ© dâimages ou de vidĂ©os de rĂ©fĂ©rence collectĂ©es lors de sessions dâinscription. Cependant, ces acquisitions ne se dĂ©roulant pas nĂ©cessairement dans les mĂȘmes conditions dâobservation, les donnĂ©es de rĂ©fĂ©rence reprĂ©sentent pas toujours la complexitĂ© du problĂšme rĂ©el. Dâautre part, bien quâil soit possible dâadapter les modĂšles de visage lorsque de nouvelles donnĂ©es de rĂ©fĂ©rence deviennent disponibles, un apprentissage incrĂ©mental basĂ© sur des donnĂ©es significativement diffĂ©rentes expose le systĂšme Ă un risque de corruption de connaissances. Enfin, seule une partie de ces connaissances est effectivement pertinente pour la classification dâune image donnĂ©e.
Dans cette thĂšse, un nouveau systĂšme est proposĂ© pour la dĂ©tection automatique dâindividus dâintĂ©rĂȘt en vidĂ©o-surveillance. Plus particuliĂšrement, celle-ci se concentre sur un scĂ©nario centrĂ© sur lâutilisateur, oĂč un systĂšme de reconnaissance de visages est intĂ©grĂ© Ă un outil dâaide Ă la dĂ©cision pour alerter un opĂ©rateur lorsquâun individu dâintĂ©rĂȘt est dĂ©tectĂ© sur des flux vidĂ©o. Un tel systĂšme se doit dâĂȘtre capable dâajouter ou supprimer des individus dâintĂ©rĂȘt durant son fonctionnement, ainsi que de mettre Ă jour leurs modĂšles de visage dans le temps avec des nouvelles donnĂ©es de rĂ©fĂ©rence. Pour cela, le systĂšme proposĂ© se base sur de la dĂ©tection de changement de concepts pour guider une stratĂ©gie dâapprentissage impliquant des ensembles de classificateurs. Chaque individu inscrit dans le systĂšme est reprĂ©sentĂ© par un ensemble de classificateurs Ă deux classes, chacun Ă©tant spĂ©cialisĂ© dans des conditions dâobservation diffĂ©rentes, dĂ©tectĂ©es dans les donnĂ©es de rĂ©fĂ©rence. De plus, une nouvelle rĂšgle pour la fusion dynamique dâensembles de classificateurs est proposĂ©e, utilisant des modĂšles de concepts pour estimer la pertinence des classificateurs vis-Ă -vis de chaque image Ă classifier. Enfin, les visages sont suivis dâune image Ă lâautre dans le but de les regrouper en trajectoires, et accumuler les dĂ©cisions dans le temps.
Au Chapitre 2, la dĂ©tection de changement de concept est dans un premier temps utilisĂ©e pour limiter lâaugmentation de complexitĂ© dâun systĂšme dâappariement de modĂšles adoptant une stratĂ©gie de mise Ă jour automatique de ses galeries. Une nouvelle approche sensible au contexte est proposĂ©e, dans laquelle seules les images de haute confiance capturĂ©es dans des conditions dâobservation diffĂ©rentes sont utilisĂ©es pour mettre Ă jour les modĂšles de visage. Des expĂ©rimentations ont Ă©tĂ© conduites avec trois bases de donnĂ©es de visages publiques. Un systĂšme dâappariement de modĂšles standard a Ă©tĂ© utilisĂ©, combinĂ© avec un module de dĂ©tection de changement dans les conditions dâillumination. Les rĂ©sultats montrent que lâapproche proposĂ©e permet de diminuer la complexitĂ© de ces systĂšmes, tout en maintenant la performance dans le temps.
Au Chapitre 3, un nouveau systĂšme adaptatif basĂ© des ensembles de classificateurs est proposĂ© pour la reconnaissance de visages en vidĂ©o-surveillance. Il est composĂ© dâun ensemble de classificateurs incrĂ©mentaux pour chaque individu inscrit, et se base sur la dĂ©tection de changement de concepts pour affiner les modĂšles de visage lorsque de nouvelles donnĂ©es sont disponibles. Une stratĂ©gie hybride est proposĂ©e, dans laquelle des classificateurs ne sont ajoutĂ©s aux ensembles que lorsquâun changement abrupt est dĂ©tectĂ© dans les donnĂ©es de rĂ©fĂ©rence. Lors dâun changement graduel, les classificateurs associĂ©s sont mis Ă jour, ce qui permet dâaffiner les connaissances propres au concept correspondant. Une implĂ©mentation particuliĂšre de ce systĂšme est proposĂ©e, utilisant des ensembles de classificateurs de type Fuzzy-ARTMAP probabilistes, gĂ©nĂ©rĂ©s et mis Ă jour Ă lâaide dâune stratĂ©gie basĂ©e sur une optimisation par essaims de particules dynamiques, et utilisant la distance de Hellinger entre histogrammes pour dĂ©tecter des changements. Les simulations rĂ©alisĂ©es sur la base de donnĂ©e de vidĂ©o-surveillance Faces in Action (FIA) montrent que le systĂšme proposĂ© permet de maintenir un haut niveau de performance dans le temps, tout en limitant la corruption de connaissance. Il montre des performances de classification supĂ©rieure Ă un systĂšme similaire passif (sans dĂ©tection de changement), ainsi quâa des systĂšmes de rĂ©fĂ©rence de type kNN probabiliste, et TCM-kNN.
Au Chapitre 4, une Ă©volution du systĂšme prĂ©sentĂ© au Chapitre 3 est proposĂ©e, intĂ©grant des mĂ©canismes permettant dâadapter dynamiquement le comportement du systĂšme aux conditions dâobservation changeantes en mode opĂ©rationnel. Une nouvelle rĂšgle de fusion basĂ©e sur de la pondĂ©ration dynamique est proposĂ©e, assignant Ă chaque classificateur un poids proportionnel Ă son niveau de compĂ©tence estimĂ© vis-Ă -vis de chaque image Ă classifier. De plus, ces compĂ©tences sont estimĂ©es Ă lâaide des modĂšles de concepts utilisĂ©s en apprentissage pour la dĂ©tection de changement, ce qui permet un allĂšgement des ressources nĂ©cessaires en mode opĂ©rationnel. Une Ă©volution de lâimplĂ©mentation proposĂ©e au Chapitre 3 est prĂ©sentĂ©e, dans laquelle les concepts sont modĂ©lisĂ©s Ă lâaide de lâalgorithme de partitionnement Fuzzy C-Means, et la fusion de classificateurs rĂ©alisĂ©e avec une moyenne pondĂ©rĂ©e. Les simulation expĂ©rimentales avec les bases de donnĂ©es de vidĂ©o-surveillance FIA et Chokepoint montrent que la mĂ©thode de fusion proposĂ©e permet dâobtenir des rĂ©sultats supĂ©rieurs Ă la mĂ©thode de sĂ©lection dynamique DSOLA, tout en utilisant considĂ©rablement moins de ressources de calcul. De plus, la mĂ©thode proposĂ©e montre des performances de classification supĂ©rieures aux systĂšmes de rĂ©fĂ©rence de type kNN probabiliste, TCM-kNN et Adaptive Sparse Coding
Aggregation of Heterogeneous Anomaly Detectors for Cyber-Physical Systems
Distributed, life-critical systems that bridge the gap between software and hardware
are becoming an integral part of our everyday lives. From autonomous cars to smart
electrical grids, such cyber-physical systems will soon be omnipresent. With this comes a
corresponding increase in our vulnerability to cyber-attacks. Monitoring such systems to
detect malicious actions is of critical importance.
One method of monitoring cyber-physical systems is anomaly detection: the process of
detecting when the target system is deviating from expected normal behavior. Anomaly
detection is a vibrant research area with many different viable approaches. The literature
suggests many different anomaly detection methods for the diversity and volume of data
from cyber-physical systems. We focus on aggregating the result of multiple anomaly
detection methods into a final anomalous or non-anomalous verdict.
In this thesis, we present Palisade, a distributed data collection, anomaly detection,
and aggregation framework for cyber-physical systems. We discuss various methods of
anomaly detection and aggregation and include a case study of anomaly aggregation on a
cyber-physical treadmill driving demonstrator. We conclude with a discussion of lessons
learned from the construction of Palisade, and recommendations for future research
Solving the challenges of concept drift in data stream classification.
The rise of network connected devices and applications leads to a significant increase in the volume of data that are continuously generated overtime time, called data streams. In real world applications, storing the entirety of a data stream for analyzing later is often not practical, due to the data streamâs potentially infinite volume. Data stream mining techniques and frameworks are therefore created to analyze streaming data as they arrive. However, compared to traditional data mining techniques, challenges unique to data stream mining also emerge, due to the high arrival rate of data streams and their dynamic nature. In this dissertation, an array of techniques and frameworks are presented to improve the solutions on some of the challenges. First, this dissertation acknowledges that a âno free lunchâ theorem exists for data stream mining, where no silver bullet solution can solve all problems of data stream mining. The dissertation focuses on detection of changes of data distribution in data stream mining. These changes are called concept drift. Concept drift can be categorized into many types. A detection algorithm often works only on some types of drift, but not all of them. Because of this, the dissertation finds specific techniques to solve specific challenges, instead of looking for a general solution. Then, this dissertation considers improving solutions for the challenges of high arrival rate of data streams. Data stream mining frameworks often need to process vast among of data samples in limited time. Some data mining activities, notably data sample labeling for classification, are too costly or too slow in such large scale. This dissertation presents two techniques that reduce the amount of labeling needed for data stream classification. The first technique presents a grid-based label selection process that apply to highly imbalanced data streams. Such data streams have one class of data samples vastly outnumber another class. Many majority class samples need to be labeled before a minority class sample can be found due to the imbalance. The presented technique divides the data samples into groups, called grids, and actively search for minority class samples that are close by within a grid. Experiment results show the technique can reduce the total number of data samples needed to be labeled. The second technique presents a smart preprocessing technique that reduce the number of times a new learning model needs to be trained due to concept drift. Less model training means less data labels required, and thus costs less. Experiment results show that in some cases the reduced performance of learning models is the result of improper preprocessing of the data, not due to concept drift. By adapting preprocessing to the changes in data streams, models can retain high performance without retraining. Acknowledging the high cost of labeling, the dissertation then considers the scenario where labels are unavailable when needed. The framework Sliding Reservoir Approach for Delayed Labeling (SRADL) is presented to explore solutions to such problem. SRADL tries to solve the delayed labeling problem where concept drift occurs, and no labels are immediately available. SRADL uses semi-supervised learning by employing a sliding windowed approach to store historical data, which is combined with newly unlabeled data to train new models. Experiments show that SRADL perform well in some cases of delayed labeling. Next, the dissertation considers improving solutions for the challenge of dynamism within data streams, most notably concept drift. The complex nature of concept drift means that most existing detection algorithms can only detect limited types of concept drift. To detect more types of concept drift, an ensemble approach that employs various algorithms, called Heuristic Ensemble Framework for Concept Drift Detection (HEFDD), is presented. The occurrence of each type of concept drift is voted on by the detection results of each algorithm in the ensemble. Types of concept drift with votes past majority are then declared detected. Experiment results show that HEFDD is able to improve detection accuracy significantly while reducing false positives. With the ability to detect various types of concept drift provided by HEFDD, the dissertation tries to improve the delayed labeling framework SRADL. A new combined framework, SRADL-HEFDD is presented, which produces synthetic labels to handle the unavailability of labels by human expert. SRADL-HEFDD employs different synthetic labeling techniques based on different types of drift detected by HEFDD. Experimental results show that comparing to the default SRADL, the combined framework improves prediction performance when small amount of labeled samples is available. Finally, as machine learning applications are increasingly used in critical domains such as medical diagnostics, accountability, explainability and interpretability of machine learning algorithms needs to be considered. Explainable machine learning aims to use a white box approach for data analytics, which enables learning models to be explained and interpreted by human users. However, few studies have been done on explaining what has changed in a dynamic data stream environment. This dissertation thus presents Data Stream Explainability (DSE) framework. DSE visualizes changes in data distribution and model classification boundaries between chunks of streaming data. The visualizations can then be used by a data mining researcher to generate explanations of what has changed within the data stream. To show that DSE can help average users understand data stream mining better, a survey was conducted with an expert group and a non-expert group of users. Results show DSE can reduce the gap of understanding what changed in data stream mining between the two groups
- âŠ