67 research outputs found

    Process-Oriented Stream Classification Pipeline:A Literature Review

    Get PDF
    Featured Application: Nowadays, many applications and disciplines work on the basis of stream data. Common examples are the IoT sector (e.g., sensor data analysis), or video, image, and text analysis applications (e.g., in social media analytics or astronomy). With our work, we gather different approaches and terminology, and give a broad overview over the topic. Our main target groups are practitioners and newcomers to the field of data stream classification. Due to the rise of continuous data-generating applications, analyzing data streams has gained increasing attention over the past decades. A core research area in stream data is stream classification, which categorizes or detects data points within an evolving stream of observations. Areas of stream classification are diverse—ranging, e.g., from monitoring sensor data to analyzing a wide range of (social) media applications. Research in stream classification is related to developing methods that adapt to the changing and potentially volatile data stream. It focuses on individual aspects of the stream classification pipeline, e.g., designing suitable algorithm architectures, an efficient train and test procedure, or detecting so-called concept drifts. As a result of the many different research questions and strands, the field is challenging to grasp, especially for beginners. This survey explores, summarizes, and categorizes work within the domain of stream classification and identifies core research threads over the past few years. It is structured based on the stream classification process to facilitate coordination within this complex topic, including common application scenarios and benchmarking data sets. Thus, both newcomers to the field and experts who want to widen their scope can gain (additional) insight into this research area and find starting points and pointers to more in-depth literature on specific issues and research directions in the field.</p

    Learning in Dynamic Data-Streams with a Scarcity of Labels

    Get PDF
    Analysing data in real-time is a natural and necessary progression from traditional data mining. However, real-time analysis presents additional challenges to batch-analysis; along with strict time and memory constraints, change is a major consideration. In a dynamic stream there is an assumption that the underlying process generating the stream is non-stationary and that concepts within the stream will drift and change over time. Adopting a false assumption that a stream is stationary will result in non-adaptive models degrading and eventually becoming obsolete. The challenge of recognising and reacting to change in a stream is compounded by the scarcity of labels problem. This refers to the very realistic situation in which the true class label of an incoming point is not immediately available (or will never be available) or in situations where manually labelling incoming points is prohibitively expensive. The goal of this thesis is to evaluate unsupervised learning as the basis for online classification in dynamic data-streams with a scarcity of labels. To realise this goal, a novel stream clustering algorithm based on the collective behaviour of ants (Ant Colony Stream Clustering (ACSC)) is proposed. This algorithm is shown to be faster and more accurate than comparative, peer stream-clustering algorithms while requiring fewer sensitive parameters. The principles of ACSC are extended in a second stream-clustering algorithm named Multi-Density Stream Clustering (MDSC). This algorithm has adaptive parameters and crucially, can track clusters and monitor their dynamic behaviour over time. A novel technique called a Dynamic Feature Mask (DFM) is proposed to ``sit on top’’ of these stream-clustering algorithms and can be used to observe and track change at the feature level in a data stream. This Feature Mask acts as an unsupervised feature selection method allowing high-dimensional streams to be clustered. Finally, data-stream clustering is evaluated as an approach to one-class classification and a novel framework (named COCEL: Clustering and One class Classification Ensemble Learning) for classification in dynamic streams with a scarcity of labels is described. The proposed framework can identify and react to change in a stream and hugely reduces the number of required labels (typically less than 0.05% of the entire stream)

    Adaptive Algorithms For Classification On High-Frequency Data Streams: Application To Finance

    Get PDF
    Mención Internacional en el título de doctorIn recent years, the problem of concept drift has gained importance in the financial domain. The succession of manias, panics and crashes have stressed the nonstationary nature and the likelihood of drastic structural changes in financial markets. The most recent literature suggests the use of conventional machine learning and statistical approaches for this. However, these techniques are unable or slow to adapt to non-stationarities and may require re-training over time, which is computationally expensive and brings financial risks. This thesis proposes a set of adaptive algorithms to deal with high-frequency data streams and applies these to the financial domain. We present approaches to handle different types of concept drifts and perform predictions using up-to-date models. These mechanisms are designed to provide fast reaction times and are thus applicable to high-frequency data. The core experiments of this thesis are based on the prediction of the price movement direction at different intraday resolutions in the SPDR S&P 500 exchange-traded fund. The proposed algorithms are benchmarked against other popular methods from the data stream mining literature and achieve competitive results. We believe that this thesis opens good research prospects for financial forecasting during market instability and structural breaks. Results have shown that our proposed methods can improve prediction accuracy in many of these scenarios. Indeed, the results obtained are compatible with ideas against the efficient market hypothesis. However, we cannot claim that we can beat consistently buy and hold; therefore, we cannot reject it.Programa de Doctorado en Ciencia y Tecnología Informåtica por la Universidad Carlos III de MadridPresidente: Gustavo Recio Isasi.- Secretario: Pedro Isasi Viñuela.- Vocal: Sandra García Rodrígue

    Adaptive Automated Machine Learning

    Get PDF
    The ever-growing demand for machine learning has led to the development of automated machine learning (AutoML) systems that can be used off the shelf by non-experts. Further, the demand for ML applications with high predictive performance exceeds the number of machine learning experts and makes the development of AutoML systems necessary. Automated Machine Learning tackles the problem of finding machine learning models with high predictive performance. Existing approaches incorporating deep learning techniques assume that all data is available at the beginning of the training process (offline learning). They configure and optimise a pipeline of preprocessing, feature engineering, and model selection by choosing suitable hyperparameters in each model pipeline step. Furthermore, they assume that the user is fully aware of the choice and, thus, the consequences of the underlying metric (such as precision, recall, or F1-measure). By variation of this metric, the search for suitable configurations and thus the adaptation of algorithms can be tailored to the user’s needs. With the creation of a vast amount of data from all kinds of sources every day, our capability to process and understand these data sets in a single batch is no longer viable. By training machine learning models incrementally (i.ex. online learning), the flood of data can be processed sequentially within data streams. However, if one assumes an online learning scenario, where an AutoML instance executes on evolving data streams, the question of the best model and its configuration remains open. In this work, we address the adaptation of AutoML in an offline learning scenario toward a certain utility an end-user might pursue as well as the adaptation of AutoML towards evolving data streams in an online learning scenario with three main contributions: 1. We propose a System that allows the adaptation of AutoML and the search for neural architectures towards a particular utility an end-user might pursue. 2. We introduce an online deep learning framework that fosters the research of deep learning models under the online learning assumption and enables the automated search for neural architectures. 3. We introduce an online AutoML framework that allows the incremental adaptation of ML models. We evaluate the contributions individually, in accordance with predefined requirements and to state-of-the- art evaluation setups. The outcomes lead us to conclude that (i) AutoML, as well as systems for neural architecture search, can be steered towards individual utilities by learning a designated ranking model from pairwise preferences and using the latter as the target function for the offline learning scenario; (ii) architectual small neural networks are in general suitable assuming an online learning scenario; (iii) the configuration of machine learning pipelines can be automatically be adapted to ever-evolving data streams and lead to better performances

    Continual learning from stationary and non-stationary data

    Get PDF
    Continual learning aims at developing models that are capable of working on constantly evolving problems over a long-time horizon. In such environments, we can distinguish three essential aspects of training and maintaining machine learning models - incorporating new knowledge, retaining it and reacting to changes. Each of them poses its own challenges, constituting a compound problem with multiple goals. Remembering previously incorporated concepts is the main property of a model that is required when dealing with stationary distributions. In non-stationary environments, models should be capable of selectively forgetting outdated decision boundaries and adapting to new concepts. Finally, a significant difficulty can be found in combining these two abilities within a single learning algorithm, since, in such scenarios, we have to balance remembering and forgetting instead of focusing only on one aspect. The presented dissertation addressed these problems in an exploratory way. Its main goal was to grasp the continual learning paradigm as a whole, analyze its different branches and tackle identified issues covering various aspects of learning from sequentially incoming data. By doing so, this work not only filled several gaps in the current continual learning research but also emphasized the complexity and diversity of challenges existing in this domain. Comprehensive experiments conducted for all of the presented contributions have demonstrated their effectiveness and substantiated the validity of the stated claims

    Adaptive classifier ensembles for face recognition in video-surveillance

    Get PDF
    Lors de l’implĂ©mentation de systĂšmes de sĂ©curitĂ© tels que la vidĂ©o-surveillance intelligente, l’utilisation d’images de visages prĂ©sente de nombreux avantages par rapport Ă  d’autres traits biomĂ©triques. En particulier, cela permet de dĂ©tecter d’éventuels individus d’intĂ©rĂȘt de maniĂšre discrĂšte et non intrusive, ce qui peut ĂȘtre particuliĂšrement avantageux dans des situations comme la dĂ©tection d’individus sur liste noire, la recherche dans des donnĂ©es archivĂ©es ou la rĂ©-identification de visages. MalgrĂ© cela, la reconnaissance de visages reste confrontĂ©e Ă  de nombreuses difficultĂ©s propres Ă  la vidĂ©o surveillance. Entre autres, le manque de contrĂŽle sur l’environnement observĂ© implique de nombreuses variations dans les conditions d’éclairage, la rĂ©solution de l’image, le flou de mouvement, l’orientation et l’expression des visages. Pour reconnaĂźtre des individus, des modĂšles de visages sont habituellement gĂ©nĂ©rĂ©s Ă  l’aide d’un nombre limitĂ© d’images ou de vidĂ©os de rĂ©fĂ©rence collectĂ©es lors de sessions d’inscription. Cependant, ces acquisitions ne se dĂ©roulant pas nĂ©cessairement dans les mĂȘmes conditions d’observation, les donnĂ©es de rĂ©fĂ©rence reprĂ©sentent pas toujours la complexitĂ© du problĂšme rĂ©el. D’autre part, bien qu’il soit possible d’adapter les modĂšles de visage lorsque de nouvelles donnĂ©es de rĂ©fĂ©rence deviennent disponibles, un apprentissage incrĂ©mental basĂ© sur des donnĂ©es significativement diffĂ©rentes expose le systĂšme Ă  un risque de corruption de connaissances. Enfin, seule une partie de ces connaissances est effectivement pertinente pour la classification d’une image donnĂ©e. Dans cette thĂšse, un nouveau systĂšme est proposĂ© pour la dĂ©tection automatique d’individus d’intĂ©rĂȘt en vidĂ©o-surveillance. Plus particuliĂšrement, celle-ci se concentre sur un scĂ©nario centrĂ© sur l’utilisateur, oĂč un systĂšme de reconnaissance de visages est intĂ©grĂ© Ă  un outil d’aide Ă  la dĂ©cision pour alerter un opĂ©rateur lorsqu’un individu d’intĂ©rĂȘt est dĂ©tectĂ© sur des flux vidĂ©o. Un tel systĂšme se doit d’ĂȘtre capable d’ajouter ou supprimer des individus d’intĂ©rĂȘt durant son fonctionnement, ainsi que de mettre Ă  jour leurs modĂšles de visage dans le temps avec des nouvelles donnĂ©es de rĂ©fĂ©rence. Pour cela, le systĂšme proposĂ© se base sur de la dĂ©tection de changement de concepts pour guider une stratĂ©gie d’apprentissage impliquant des ensembles de classificateurs. Chaque individu inscrit dans le systĂšme est reprĂ©sentĂ© par un ensemble de classificateurs Ă  deux classes, chacun Ă©tant spĂ©cialisĂ© dans des conditions d’observation diffĂ©rentes, dĂ©tectĂ©es dans les donnĂ©es de rĂ©fĂ©rence. De plus, une nouvelle rĂšgle pour la fusion dynamique d’ensembles de classificateurs est proposĂ©e, utilisant des modĂšles de concepts pour estimer la pertinence des classificateurs vis-Ă -vis de chaque image Ă  classifier. Enfin, les visages sont suivis d’une image Ă  l’autre dans le but de les regrouper en trajectoires, et accumuler les dĂ©cisions dans le temps. Au Chapitre 2, la dĂ©tection de changement de concept est dans un premier temps utilisĂ©e pour limiter l’augmentation de complexitĂ© d’un systĂšme d’appariement de modĂšles adoptant une stratĂ©gie de mise Ă  jour automatique de ses galeries. Une nouvelle approche sensible au contexte est proposĂ©e, dans laquelle seules les images de haute confiance capturĂ©es dans des conditions d’observation diffĂ©rentes sont utilisĂ©es pour mettre Ă  jour les modĂšles de visage. Des expĂ©rimentations ont Ă©tĂ© conduites avec trois bases de donnĂ©es de visages publiques. Un systĂšme d’appariement de modĂšles standard a Ă©tĂ© utilisĂ©, combinĂ© avec un module de dĂ©tection de changement dans les conditions d’illumination. Les rĂ©sultats montrent que l’approche proposĂ©e permet de diminuer la complexitĂ© de ces systĂšmes, tout en maintenant la performance dans le temps. Au Chapitre 3, un nouveau systĂšme adaptatif basĂ© des ensembles de classificateurs est proposĂ© pour la reconnaissance de visages en vidĂ©o-surveillance. Il est composĂ© d’un ensemble de classificateurs incrĂ©mentaux pour chaque individu inscrit, et se base sur la dĂ©tection de changement de concepts pour affiner les modĂšles de visage lorsque de nouvelles donnĂ©es sont disponibles. Une stratĂ©gie hybride est proposĂ©e, dans laquelle des classificateurs ne sont ajoutĂ©s aux ensembles que lorsqu’un changement abrupt est dĂ©tectĂ© dans les donnĂ©es de rĂ©fĂ©rence. Lors d’un changement graduel, les classificateurs associĂ©s sont mis Ă  jour, ce qui permet d’affiner les connaissances propres au concept correspondant. Une implĂ©mentation particuliĂšre de ce systĂšme est proposĂ©e, utilisant des ensembles de classificateurs de type Fuzzy-ARTMAP probabilistes, gĂ©nĂ©rĂ©s et mis Ă  jour Ă  l’aide d’une stratĂ©gie basĂ©e sur une optimisation par essaims de particules dynamiques, et utilisant la distance de Hellinger entre histogrammes pour dĂ©tecter des changements. Les simulations rĂ©alisĂ©es sur la base de donnĂ©e de vidĂ©o-surveillance Faces in Action (FIA) montrent que le systĂšme proposĂ© permet de maintenir un haut niveau de performance dans le temps, tout en limitant la corruption de connaissance. Il montre des performances de classification supĂ©rieure Ă  un systĂšme similaire passif (sans dĂ©tection de changement), ainsi qu’a des systĂšmes de rĂ©fĂ©rence de type kNN probabiliste, et TCM-kNN. Au Chapitre 4, une Ă©volution du systĂšme prĂ©sentĂ© au Chapitre 3 est proposĂ©e, intĂ©grant des mĂ©canismes permettant d’adapter dynamiquement le comportement du systĂšme aux conditions d’observation changeantes en mode opĂ©rationnel. Une nouvelle rĂšgle de fusion basĂ©e sur de la pondĂ©ration dynamique est proposĂ©e, assignant Ă  chaque classificateur un poids proportionnel Ă  son niveau de compĂ©tence estimĂ© vis-Ă -vis de chaque image Ă  classifier. De plus, ces compĂ©tences sont estimĂ©es Ă  l’aide des modĂšles de concepts utilisĂ©s en apprentissage pour la dĂ©tection de changement, ce qui permet un allĂšgement des ressources nĂ©cessaires en mode opĂ©rationnel. Une Ă©volution de l’implĂ©mentation proposĂ©e au Chapitre 3 est prĂ©sentĂ©e, dans laquelle les concepts sont modĂ©lisĂ©s Ă  l’aide de l’algorithme de partitionnement Fuzzy C-Means, et la fusion de classificateurs rĂ©alisĂ©e avec une moyenne pondĂ©rĂ©e. Les simulation expĂ©rimentales avec les bases de donnĂ©es de vidĂ©o-surveillance FIA et Chokepoint montrent que la mĂ©thode de fusion proposĂ©e permet d’obtenir des rĂ©sultats supĂ©rieurs Ă  la mĂ©thode de sĂ©lection dynamique DSOLA, tout en utilisant considĂ©rablement moins de ressources de calcul. De plus, la mĂ©thode proposĂ©e montre des performances de classification supĂ©rieures aux systĂšmes de rĂ©fĂ©rence de type kNN probabiliste, TCM-kNN et Adaptive Sparse Coding

    Aggregation of Heterogeneous Anomaly Detectors for Cyber-Physical Systems

    Get PDF
    Distributed, life-critical systems that bridge the gap between software and hardware are becoming an integral part of our everyday lives. From autonomous cars to smart electrical grids, such cyber-physical systems will soon be omnipresent. With this comes a corresponding increase in our vulnerability to cyber-attacks. Monitoring such systems to detect malicious actions is of critical importance. One method of monitoring cyber-physical systems is anomaly detection: the process of detecting when the target system is deviating from expected normal behavior. Anomaly detection is a vibrant research area with many different viable approaches. The literature suggests many different anomaly detection methods for the diversity and volume of data from cyber-physical systems. We focus on aggregating the result of multiple anomaly detection methods into a final anomalous or non-anomalous verdict. In this thesis, we present Palisade, a distributed data collection, anomaly detection, and aggregation framework for cyber-physical systems. We discuss various methods of anomaly detection and aggregation and include a case study of anomaly aggregation on a cyber-physical treadmill driving demonstrator. We conclude with a discussion of lessons learned from the construction of Palisade, and recommendations for future research

    Solving the challenges of concept drift in data stream classification.

    Get PDF
    The rise of network connected devices and applications leads to a significant increase in the volume of data that are continuously generated overtime time, called data streams. In real world applications, storing the entirety of a data stream for analyzing later is often not practical, due to the data stream’s potentially infinite volume. Data stream mining techniques and frameworks are therefore created to analyze streaming data as they arrive. However, compared to traditional data mining techniques, challenges unique to data stream mining also emerge, due to the high arrival rate of data streams and their dynamic nature. In this dissertation, an array of techniques and frameworks are presented to improve the solutions on some of the challenges. First, this dissertation acknowledges that a “no free lunch” theorem exists for data stream mining, where no silver bullet solution can solve all problems of data stream mining. The dissertation focuses on detection of changes of data distribution in data stream mining. These changes are called concept drift. Concept drift can be categorized into many types. A detection algorithm often works only on some types of drift, but not all of them. Because of this, the dissertation finds specific techniques to solve specific challenges, instead of looking for a general solution. Then, this dissertation considers improving solutions for the challenges of high arrival rate of data streams. Data stream mining frameworks often need to process vast among of data samples in limited time. Some data mining activities, notably data sample labeling for classification, are too costly or too slow in such large scale. This dissertation presents two techniques that reduce the amount of labeling needed for data stream classification. The first technique presents a grid-based label selection process that apply to highly imbalanced data streams. Such data streams have one class of data samples vastly outnumber another class. Many majority class samples need to be labeled before a minority class sample can be found due to the imbalance. The presented technique divides the data samples into groups, called grids, and actively search for minority class samples that are close by within a grid. Experiment results show the technique can reduce the total number of data samples needed to be labeled. The second technique presents a smart preprocessing technique that reduce the number of times a new learning model needs to be trained due to concept drift. Less model training means less data labels required, and thus costs less. Experiment results show that in some cases the reduced performance of learning models is the result of improper preprocessing of the data, not due to concept drift. By adapting preprocessing to the changes in data streams, models can retain high performance without retraining. Acknowledging the high cost of labeling, the dissertation then considers the scenario where labels are unavailable when needed. The framework Sliding Reservoir Approach for Delayed Labeling (SRADL) is presented to explore solutions to such problem. SRADL tries to solve the delayed labeling problem where concept drift occurs, and no labels are immediately available. SRADL uses semi-supervised learning by employing a sliding windowed approach to store historical data, which is combined with newly unlabeled data to train new models. Experiments show that SRADL perform well in some cases of delayed labeling. Next, the dissertation considers improving solutions for the challenge of dynamism within data streams, most notably concept drift. The complex nature of concept drift means that most existing detection algorithms can only detect limited types of concept drift. To detect more types of concept drift, an ensemble approach that employs various algorithms, called Heuristic Ensemble Framework for Concept Drift Detection (HEFDD), is presented. The occurrence of each type of concept drift is voted on by the detection results of each algorithm in the ensemble. Types of concept drift with votes past majority are then declared detected. Experiment results show that HEFDD is able to improve detection accuracy significantly while reducing false positives. With the ability to detect various types of concept drift provided by HEFDD, the dissertation tries to improve the delayed labeling framework SRADL. A new combined framework, SRADL-HEFDD is presented, which produces synthetic labels to handle the unavailability of labels by human expert. SRADL-HEFDD employs different synthetic labeling techniques based on different types of drift detected by HEFDD. Experimental results show that comparing to the default SRADL, the combined framework improves prediction performance when small amount of labeled samples is available. Finally, as machine learning applications are increasingly used in critical domains such as medical diagnostics, accountability, explainability and interpretability of machine learning algorithms needs to be considered. Explainable machine learning aims to use a white box approach for data analytics, which enables learning models to be explained and interpreted by human users. However, few studies have been done on explaining what has changed in a dynamic data stream environment. This dissertation thus presents Data Stream Explainability (DSE) framework. DSE visualizes changes in data distribution and model classification boundaries between chunks of streaming data. The visualizations can then be used by a data mining researcher to generate explanations of what has changed within the data stream. To show that DSE can help average users understand data stream mining better, a survey was conducted with an expert group and a non-expert group of users. Results show DSE can reduce the gap of understanding what changed in data stream mining between the two groups
    • 

    corecore