228 research outputs found

    Process-Oriented Stream Classification Pipeline:A Literature Review

    Get PDF
    Featured Application: Nowadays, many applications and disciplines work on the basis of stream data. Common examples are the IoT sector (e.g., sensor data analysis), or video, image, and text analysis applications (e.g., in social media analytics or astronomy). With our work, we gather different approaches and terminology, and give a broad overview over the topic. Our main target groups are practitioners and newcomers to the field of data stream classification. Due to the rise of continuous data-generating applications, analyzing data streams has gained increasing attention over the past decades. A core research area in stream data is stream classification, which categorizes or detects data points within an evolving stream of observations. Areas of stream classification are diverse—ranging, e.g., from monitoring sensor data to analyzing a wide range of (social) media applications. Research in stream classification is related to developing methods that adapt to the changing and potentially volatile data stream. It focuses on individual aspects of the stream classification pipeline, e.g., designing suitable algorithm architectures, an efficient train and test procedure, or detecting so-called concept drifts. As a result of the many different research questions and strands, the field is challenging to grasp, especially for beginners. This survey explores, summarizes, and categorizes work within the domain of stream classification and identifies core research threads over the past few years. It is structured based on the stream classification process to facilitate coordination within this complex topic, including common application scenarios and benchmarking data sets. Thus, both newcomers to the field and experts who want to widen their scope can gain (additional) insight into this research area and find starting points and pointers to more in-depth literature on specific issues and research directions in the field.</p

    Learning in Dynamic Data-Streams with a Scarcity of Labels

    Get PDF
    Analysing data in real-time is a natural and necessary progression from traditional data mining. However, real-time analysis presents additional challenges to batch-analysis; along with strict time and memory constraints, change is a major consideration. In a dynamic stream there is an assumption that the underlying process generating the stream is non-stationary and that concepts within the stream will drift and change over time. Adopting a false assumption that a stream is stationary will result in non-adaptive models degrading and eventually becoming obsolete. The challenge of recognising and reacting to change in a stream is compounded by the scarcity of labels problem. This refers to the very realistic situation in which the true class label of an incoming point is not immediately available (or will never be available) or in situations where manually labelling incoming points is prohibitively expensive. The goal of this thesis is to evaluate unsupervised learning as the basis for online classification in dynamic data-streams with a scarcity of labels. To realise this goal, a novel stream clustering algorithm based on the collective behaviour of ants (Ant Colony Stream Clustering (ACSC)) is proposed. This algorithm is shown to be faster and more accurate than comparative, peer stream-clustering algorithms while requiring fewer sensitive parameters. The principles of ACSC are extended in a second stream-clustering algorithm named Multi-Density Stream Clustering (MDSC). This algorithm has adaptive parameters and crucially, can track clusters and monitor their dynamic behaviour over time. A novel technique called a Dynamic Feature Mask (DFM) is proposed to ``sit on top’’ of these stream-clustering algorithms and can be used to observe and track change at the feature level in a data stream. This Feature Mask acts as an unsupervised feature selection method allowing high-dimensional streams to be clustered. Finally, data-stream clustering is evaluated as an approach to one-class classification and a novel framework (named COCEL: Clustering and One class Classification Ensemble Learning) for classification in dynamic streams with a scarcity of labels is described. The proposed framework can identify and react to change in a stream and hugely reduces the number of required labels (typically less than 0.05% of the entire stream)

    Ensemble based on randomised neural networks for online data stream regression in presence of concept drift

    Get PDF
    The big data paradigm has posed new challenges for the Machine Learning algorithms, such as analysing continuous flows of data, in the form of data streams, and dealing with the evolving nature of the data, which cause a phenomenon often referred to in the literature as concept drift. Concept drift is caused by inconsistencies between the optimal hypotheses in two subsequent chunks of data, whereby the concept underlying a given process evolves over time, which can happen due to several factors including change in consumer preference, economic dynamics, or environmental conditions. This thesis explores the problem of data stream regression with the presence of concept drift. This problem requires computationally efficient algorithms that are able to adapt to the various types of drift that may affect the data. The development of effective algorithms for data streams with concept drift requires several steps that are discussed in this research. The first one is related to the datasets required to assess the algorithms. In general, it is not possible to determine the occurrence of concept drift on real-world datasets; therefore, synthetic datasets where the various types of concept drift can be simulated are required. The second issue is related to the choice of the algorithm. The ensemble algorithms show many advantages to deal with concept drifting data streams, which include flexibility, computational efficiency and high accuracy. For the design of an effective ensemble, this research analyses the use of randomised Neural Networks as base models, along with their optimisation. The optimisation of the randomised Neural Networks involves design and tuning hyperparameters which may substantially affect its performance. The optimisation of the base models is an important aspect to build highly accurate and computationally efficient ensembles. To cope with the concept drift, the existing methods either require setting fixed updating points, which may result in unnecessary computations or slow reaction to concept drift, or rely on drifting detection mechanism, which may be ineffective due to the difficulty to detect drift in real applications. Therefore, the research contributions of this thesis include the development of a new approach for synthetic dataset generation, development of a new hyperparameter optimisation algorithm that reduces the search effort and the need of prior assumptions compared to existing methods, the analysis of the effects of randomised Neural Networks hyperparameters, and the development of a new ensemble algorithm based on bagging meta-model that reduces the computational effort over existing methods and uses an innovative updating mechanism to cope with concept drift. The algorithms have been tested on synthetic datasets and validated on four real-world datasets from various application domains

    FairCanary: Rapid Continuous Explainable Fairness

    Full text link
    Machine Learning (ML) models are being used in all facets of today's society to make high stake decisions like bail granting or credit lending, with very minimal regulations. Such systems are extremely vulnerable to both propagating and amplifying social biases, and have therefore been subject to growing research interest. One of the main issues with conventional fairness metrics is their narrow definitions which hide the complete extent of the bias by focusing primarily on positive and/or negative outcomes, whilst not paying attention to the overall distributional shape. Moreover, these metrics are often contradictory to each other, are severely restrained by the contextual and legal landscape of the problem, have technical constraints like poor support for continuous outputs, the requirement of class labels, and are not explainable. In this paper, we present Quantile Demographic Drift, which addresses the shortcomings mentioned above. This metric can also be used to measure intra-group privilege. It is easily interpretable via existing attribution techniques, and also extends naturally to individual fairness via the principle of like-for-like comparison. We make this new fairness score the basis of a new system that is designed to detect bias in production ML models without the need for labels. We call the system FairCanary because of its capability to detect bias in a live deployed model and narrow down the alert to the responsible set of features, like the proverbial canary in a coal mine

    COMPOSE: Compacted object sample extraction a framework for semi-supervised learning in nonstationary environments

    Get PDF
    An increasing number of real-world applications are associated with streaming data drawn from drifting and nonstationary distributions. These applications demand new algorithms that can learn and adapt to such changes, also known as concept drift. Proper characterization of such data with existing approaches typically requires substantial amount of labeled instances, which may be difficult, expensive, or even impractical to obtain. In this thesis, compacted object sample extraction (COMPOSE) is introduced - a computational geometry-based framework to learn from nonstationary streaming data - where labels are unavailable (or presented very sporadically) after initialization. The feasibility and performance of the algorithm are evaluated on several synthetic and real-world data sets, which present various different scenarios of initially labeled streaming environments. On carefully designed synthetic data sets, we also compare the performance of COMPOSE against the optimal Bayes classifier, as well as the arbitrary subpopulation tracker algorithm, which addresses a similar environment referred to as extreme verification latency. Furthermore, using the real-world National Oceanic and Atmospheric Administration weather data set, we demonstrate that COMPOSE is competitive even with a well-established and fully supervised nonstationary learning algorithm that receives labeled data in every batch

    Learning Patterns from Imbalanced Evolving Data Streams

    Get PDF
    Learning patterns from evolving data streams is challenging due to the characteristics of such streams: being continuous, unbounded and high speed data of non-stationary nature, which must be processed on the fly, using minimal computational resources. An additional challenge is imposed by the imbalanced data streams in many real-world applications, this difficulty becomes more prominent in multi-class learning tasks. This paper investigates the multi-class imbalance problem in non-stationary streams and develops a method to exploit realtime stream data and capture the dynamic of patterns from heterogeneous streams. In particular, we seek to extend concept drift adaptation techniques into imbalanced classes’ scenarios, and accordingly, we use an adaptive learner to classify multiple streams over a sequence of titled time windows. We include examples of the falsely classified instances in the training set, then we propose using a dynamic support threshold to discover the frequent patterns in these streams. We conduct an experiment on the car parking lots environment of a typical University with three simulated streams from sensors, smart pay stations and a mobile application. The result indicates the efficiency of applying adaptive learner approaches and modifying the training set to cope with the concept drift in multi-class imbalance scenarios, it also shows the merit of using a dynamic threshold to detect the rare patterns from evolving streams

    Towards Reliable Machine Learning in Evolving Data Streams

    Get PDF
    Data streams are ubiquitous in many areas of modern life. For example, applications in healthcare, education, finance, or advertising often deal with large-scale and evolving data streams. Compared to stationary applications, data streams pose considerable additional challenges for automated decision making and machine learning. Indeed, online machine learning methods must cope with limited memory capacities, real-time requirements, and drifts in the data generating process. At the same time, online learning methods should provide a high predictive quality, stability in the presence of input noise, and good interpretability in order to be reliably used in practice. In this thesis, we address some of the most important aspects of machine learning in evolving data streams. Specifically, we identify four open issues related to online feature selection, concept drift detection, online classification, local explainability, and the evaluation of online learning methods. In these contexts, we present new theoretical and empirical findings as well as novel frameworks and implementations. In particular, we propose new approaches for online feature selection and concept drift detection that can account for model uncertainties and thus achieve more stable results. Moreover, we introduce a new incremental decision tree that retains valuable interpretability properties and a new change detection framework that allows for more efficient explanations based on local feature attributions. In fact, this is one of the first works to address intrinsic model interpretability and local explainability in the presence of incremental updates and concept drift. Along with this thesis, we provide extensive open resources related to online machine learning. Notably, we introduce a new Python framework that enables simplified and standardized evaluations and can thus serve as a basis for more comparable online learning experiments in the future. In total, this thesis is based on six publications, five of which were peer-reviewed at the time of publication of this thesis. Our work touches all major areas of predictive modeling in data streams and proposes novel solutions for efficient, stable, interpretable and thus reliable online machine learning.Datenströme sind in vielen Bereichen des modernen Lebens allgegenwärtig. Beispielsweise haben Anwendungen im Gesundheitswesen, im Bildungswesen, im Finanzwesen oder in der Werbung häufig mit großen und sich verändernden Datenströmen zu tun. Im Vergleich zu stationären Anwendungen stellen Datenströme eine erhebliche zusätzliche Herausforderung für die automatisierte Entscheidungsfindung und das maschinelle Lernen dar. So müssen Online Machine Learning-Verfahren mit begrenzten Speicherkapazitäten, Echtzeitanforderungen und Veränderungen des Daten-generierenden Prozesses zurechtkommen. Gleichzeitig sollten Online Learning-Verfahren eine hohe Vorhersagequalität, Stabilität bei Eingangsrauschen und eine gute Interpretierbarkeit aufweisen, um in der Praxis zuverlässig eingesetzt werden zu können. In dieser Arbeit befassen wir uns mit einigen der wichtigsten Aspekte des maschinellen Lernens in sich entwickelnden Datenströmen. Insbesondere identifizieren wir vier offene Fragen im Zusammenhang mit Online Feature Selection, Concept Drift Detection, Online-Klassifikation, lokaler Erklärbarkeit und der Bewertung von Online Learning-Methoden. In diesem Kontext präsentieren wir neue theoretische und empirische Erkenntnisse sowie neue Frameworks und Implementierungen. Insbesondere schlagen wir neue Ansätze für Online Feature Selection und Concept Drift Detection vor, die Unsicherheiten im Modell berücksichtigen und dadurch stabilere Ergebnisse erzielen können. Darüber hinaus stellen wir einen neuen inkrementellen Entscheidungsbaum vor, der wertvolle Eigenschaften hinsichtlich der Interpretierbarkeit einhält, sowie ein neues Framework zur Erkennung von Veränderungen, das effizientere Erklärungen auf der Grundlage lokaler Feature Attributions ermöglicht. Tatsächlich ist dies eine der ersten Arbeiten, die sich mit intrinsischer Interpretierbarkeit von Modellen und lokaler Erklärbarkeit bei inkrementellen Aktualisierungen und Concept Drift befasst. Gemeinsam mit dieser Arbeit stellen wir umfangreiche Ressourcen für Online Machine Learning zur Verfügung. Insbesondere stellen wir ein neues Python-Framework vor, das vereinfachte und standardisierte Auswertungen ermöglicht und künftig somit als Grundlage für vergleichbare Online Learning-Experimente dienen kann. Insgesamt stützt sich diese Arbeit auf sechs Publikationen, von denen fünf zum Zeitpunkt der Veröffentlichung der Dissertation bereits im Peer-Review Format begutachtet wurden. Unsere Arbeit berührt alle wichtigen Bereiche der prädiktiven Modellierung in Datenströmen und schlägt neuartige Lösungen für effizientes, stabiles, interpretierbares und damit zuverlässiges Online Machine Learning vor
    • …
    corecore