6 research outputs found

    Detecting Concept Drift With Neural Network Model Uncertainty

    Get PDF
    Deployed machine learning models are confronted with the problem of changing data over time, a phenomenon also called concept drift. While existing approaches of concept drift detection already show convincing results, they require true labels as a prerequisite for successful drift detection. Especially in many real-world application scenarios-like the ones covered in this work-true labels are scarce, and their acquisition is expensive. Therefore, we introduce a new algorithm for drift detection, Uncertainty Drift Detection (UDD), which is able to detect drifts without access to true labels. Our approach is based on the uncertainty estimates provided by a deep neural network in combination with Monte Carlo Dropout. Structural changes over time are detected by applying the ADWIN technique on the uncertainty estimates, and detected drifts trigger a retraining of the prediction model. In contrast to input data-based drift detection, our approach considers the effects of the current input data on the properties of the prediction model rather than detecting change on the input data only (which can lead to unnecessary retrainings). We show that UDD outperforms other state-of-the-art strategies on two synthetic as well as ten real-world data sets for both regression and classification tasks

    Rule-based preprocessing for data stream mining using complex event processing

    Get PDF
    Data preprocessing is known to be essential to produce accurate data from which mining methods are able to extract valuable knowledge. When data constantly arrives from one or more sources, preprocessing techniques need to be adapted to efficiently handle these data streams. To help domain experts to define and execute preprocessing tasks for data streams, this paper proposes the use of active rule-based systems and, more specifically, complex event processing (CEP) languages and engines. The main contribution of our approach is the formulation of preprocessing procedures as event detection rules, expressed in an SQL-like language, that provide domain experts a simple way to manipulate temporal data. This idea is materialized into a publicly available solution that integrates a CEP engine with a library for online data mining. To evaluate our approach, we present three practical scenarios in which CEP rules preprocess data streams with the aim of adding temporal information, transforming features and handling missing values. Experiments show how CEP rules provide an effective language to express preprocessing tasks in a modular and high-level manner, without significant time and memory overheads. The resulting data streams do not only help improving the predictive accuracy of classification algorithms, but also allow reducing the complexity of the decision models and the time needed for learning in some cases

    Challenges in benchmarking stream learning algorithms with real-world data

    Full text link
    Streaming data are increasingly present in real-world applications such as sensor measurements, satellite data feed, stock market, and financial data. The main characteristics of these applications are the online arrival of data observations at high speed and the susceptibility to changes in the data distributions due to the dynamic nature of real environments. The data stream mining community still faces some primary challenges and difficulties related to the comparison and evaluation of new proposals, mainly due to the lack of publicly available high quality non-stationary real-world datasets. The comparison of stream algorithms proposed in the literature is not an easy task, as authors do not always follow the same recommendations, experimental evaluation procedures, datasets, and assumptions. In this paper, we mitigate problems related to the choice of datasets in the experimental evaluation of stream classifiers and drift detectors. To that end, we propose a new public data repository for benchmarking stream algorithms with real-world data. This repository contains the most popular datasets from literature and new datasets related to a highly relevant public health problem that involves the recognition of disease vector insects using optical sensors. The main advantage of these new datasets is the prior knowledge of their characteristics and patterns of changes to adequately evaluate new adaptive algorithms. We also present an in-depth discussion about the characteristics, reasons, and issues that lead to different types of changes in data distribution, as well as a critical review of common problems concerning the current benchmark datasets available in the literature

    Towards Reliable Machine Learning in Evolving Data Streams

    Get PDF
    Data streams are ubiquitous in many areas of modern life. For example, applications in healthcare, education, finance, or advertising often deal with large-scale and evolving data streams. Compared to stationary applications, data streams pose considerable additional challenges for automated decision making and machine learning. Indeed, online machine learning methods must cope with limited memory capacities, real-time requirements, and drifts in the data generating process. At the same time, online learning methods should provide a high predictive quality, stability in the presence of input noise, and good interpretability in order to be reliably used in practice. In this thesis, we address some of the most important aspects of machine learning in evolving data streams. Specifically, we identify four open issues related to online feature selection, concept drift detection, online classification, local explainability, and the evaluation of online learning methods. In these contexts, we present new theoretical and empirical findings as well as novel frameworks and implementations. In particular, we propose new approaches for online feature selection and concept drift detection that can account for model uncertainties and thus achieve more stable results. Moreover, we introduce a new incremental decision tree that retains valuable interpretability properties and a new change detection framework that allows for more efficient explanations based on local feature attributions. In fact, this is one of the first works to address intrinsic model interpretability and local explainability in the presence of incremental updates and concept drift. Along with this thesis, we provide extensive open resources related to online machine learning. Notably, we introduce a new Python framework that enables simplified and standardized evaluations and can thus serve as a basis for more comparable online learning experiments in the future. In total, this thesis is based on six publications, five of which were peer-reviewed at the time of publication of this thesis. Our work touches all major areas of predictive modeling in data streams and proposes novel solutions for efficient, stable, interpretable and thus reliable online machine learning.Datenströme sind in vielen Bereichen des modernen Lebens allgegenwĂ€rtig. Beispielsweise haben Anwendungen im Gesundheitswesen, im Bildungswesen, im Finanzwesen oder in der Werbung hĂ€ufig mit großen und sich verĂ€ndernden Datenströmen zu tun. Im Vergleich zu stationĂ€ren Anwendungen stellen Datenströme eine erhebliche zusĂ€tzliche Herausforderung fĂŒr die automatisierte Entscheidungsfindung und das maschinelle Lernen dar. So mĂŒssen Online Machine Learning-Verfahren mit begrenzten SpeicherkapazitĂ€ten, Echtzeitanforderungen und VerĂ€nderungen des Daten-generierenden Prozesses zurechtkommen. Gleichzeitig sollten Online Learning-Verfahren eine hohe VorhersagequalitĂ€t, StabilitĂ€t bei Eingangsrauschen und eine gute Interpretierbarkeit aufweisen, um in der Praxis zuverlĂ€ssig eingesetzt werden zu können. In dieser Arbeit befassen wir uns mit einigen der wichtigsten Aspekte des maschinellen Lernens in sich entwickelnden Datenströmen. Insbesondere identifizieren wir vier offene Fragen im Zusammenhang mit Online Feature Selection, Concept Drift Detection, Online-Klassifikation, lokaler ErklĂ€rbarkeit und der Bewertung von Online Learning-Methoden. In diesem Kontext prĂ€sentieren wir neue theoretische und empirische Erkenntnisse sowie neue Frameworks und Implementierungen. Insbesondere schlagen wir neue AnsĂ€tze fĂŒr Online Feature Selection und Concept Drift Detection vor, die Unsicherheiten im Modell berĂŒcksichtigen und dadurch stabilere Ergebnisse erzielen können. DarĂŒber hinaus stellen wir einen neuen inkrementellen Entscheidungsbaum vor, der wertvolle Eigenschaften hinsichtlich der Interpretierbarkeit einhĂ€lt, sowie ein neues Framework zur Erkennung von VerĂ€nderungen, das effizientere ErklĂ€rungen auf der Grundlage lokaler Feature Attributions ermöglicht. TatsĂ€chlich ist dies eine der ersten Arbeiten, die sich mit intrinsischer Interpretierbarkeit von Modellen und lokaler ErklĂ€rbarkeit bei inkrementellen Aktualisierungen und Concept Drift befasst. Gemeinsam mit dieser Arbeit stellen wir umfangreiche Ressourcen fĂŒr Online Machine Learning zur VerfĂŒgung. Insbesondere stellen wir ein neues Python-Framework vor, das vereinfachte und standardisierte Auswertungen ermöglicht und kĂŒnftig somit als Grundlage fĂŒr vergleichbare Online Learning-Experimente dienen kann. Insgesamt stĂŒtzt sich diese Arbeit auf sechs Publikationen, von denen fĂŒnf zum Zeitpunkt der Veröffentlichung der Dissertation bereits im Peer-Review Format begutachtet wurden. Unsere Arbeit berĂŒhrt alle wichtigen Bereiche der prĂ€diktiven Modellierung in Datenströmen und schlĂ€gt neuartige Lösungen fĂŒr effizientes, stabiles, interpretierbares und damit zuverlĂ€ssiges Online Machine Learning vor
    corecore