15 research outputs found

    Sensitivity-Based Optimization of Unsupervised Drift Detection for Categorical Data Streams

    Get PDF
    Real-world data streams are rarely characterized by stationary data distributions. Instead, the phenomenon commonly termed as concept drift, threatens the performance of estimators conducting inference on such data. Our contribution builds on the unsupervised concept drift detector CDCStream, which is specialized on processing categorical data directly. We propose a cooldown mechanism aiming at reducing its excessive sensitivity in order to curb false-alarm detections. Using practical classification and regression problems, we evaluate the impact of the mechanism on estimation performance and highlight the transferability of our mechanism on other detection methods. Additionally, we provide an intuitive means for tuning the sensitivity of drift detectors. While only marginally improving the unaltered form of the detector on publicly available benchmark data, our mechanism does so consistently in almost all configurations. In contrast, within the context of another real-world scenario, almost none of the tested drift-detection-based approaches could outperform a baseline approach. However, potentially false-alarm detections are reduced drastically in all scenarios. With this resulting in a cutback in signals for refitting estimators, while maintaining a better or at least comparable performance to vanilla CDCStream, compute infrastructure utilization could be economized further

    Incremental Market Behavior Classification in Presence of Recurring Concepts

    Get PDF
    In recent years, the problem of concept drift has gained importance in the financial domain. The succession of manias, panics and crashes have stressed the non-stationary nature and the likelihood of drastic structural or concept changes in the markets. Traditional systems are unable or slow to adapt to these changes. Ensemble-based systems are widely known for their good results predicting both cyclic and non-stationary data such as stock prices. In this work, we propose RCARF (Recurring Concepts Adaptive Random Forests), an ensemble tree-based online classifier that handles recurring concepts explicitly. The algorithm extends the capabilities of a version of Random Forest for evolving data streams, adding on top a mechanism to store and handle a shared collection of inactive trees, called concept history, which holds memories of the way market operators reacted in similar circumstances. This works in conjunction with a decision strategy that reacts to drift by replacing active trees with the best available alternative: either a previously stored tree from the concept history or a newly trained background tree. Both mechanisms are designed to provide fast reaction times and are thus applicable to high-frequency data. The experimental validation of the algorithm is based on the prediction of price movement directions one second ahead in the SPDR (Standard & Poor's Depositary Receipts) S&P 500 Exchange-Traded Fund. RCARF is benchmarked against other popular methods from the incremental online machine learning literature and is able to achieve competitive results.This research was funded by the Spanish Ministry of Economy and Competitiveness under grant number ENE2014-56126-C2-2-R

    Optimization and Prediction Techniques for Self-Healing and Self-Learning Applications in a Trustworthy Cloud Continuum

    Get PDF
    The current IT market is more and more dominated by the “cloud continuum”. In the “traditional” cloud, computing resources are typically homogeneous in order to facilitate economies of scale. In contrast, in edge computing, computational resources are widely diverse, commonly with scarce capacities and must be managed very efficiently due to battery constraints or other limitations. A combination of resources and services at the edge (edge computing), in the core (cloud computing), and along the data path (fog computing) is needed through a trusted cloud continuum. This requires novel solutions for the creation, optimization, management, and automatic operation of such infrastructure through new approaches such as infrastructure as code (IaC). In this paper, we analyze how artificial intelligence (AI)-based techniques and tools can enhance the operation of complex applications to support the broad and multi-stage heterogeneity of the infrastructural layer in the “computing continuum” through the enhancement of IaC optimization, IaC self-learning, and IaC self-healing. To this extent, the presented work proposes a set of tools, methods, and techniques for applications’ operators to seamlessly select, combine, configure, and adapt computation resources all along the data path and support the complete service lifecycle covering: (1) optimized distributed application deployment over heterogeneous computing resources; (2) monitoring of execution platforms in real time including continuous control and trust of the infrastructural services; (3) application deployment and adaptation while optimizing the execution; and (4) application self-recovery to avoid compromising situations that may lead to an unexpected failure.This research was funded by the European project PIACERE (Horizon 2020 research and innovation Program, under grant agreement no 101000162)

    On ensemble techniques for data stream regression

    Get PDF
    An ensemble of learners tends to exceed the predictive performance of individual learners. This approach has been explored for both batch and online learning. Ensembles methods applied to data stream classification were thoroughly investigated over the years, while their regression counterparts received less attention in comparison. In this work, we discuss and analyze several techniques for generating, aggregating, and updating ensembles of regressors for evolving data streams. We investigate the impact of different strategies for inducing diversity into the ensemble by randomizing the input data (resampling, random subspaces and random patches). On top of that, we devote particular attention to techniques that adapt the ensemble model in response to concept drifts, including adaptive window approaches, fixed periodical resets and randomly determined windows. Extensive empirical experiments show that simple techniques can obtain similar predictive performance to sophisticated algorithms that rely on reactive adaptation (i.e., concept drift detection and recovery)

    Towards time-evolving analytics: Online learning for time-dependent evolving data streams

    Get PDF
    Traditional historical data analytics is at risk in a world where volatility, uncertainty, complexity, and ambiguity are the new normal. While Streaming Machine Learning (SML) and Time-series Analytics (TSA) attack some aspects of the problem, we still need a comprehensive solution. SML trains models using fewer data and in a continuous/adaptive way relaxing the assumption that data points are identically distributed. TSA considers temporal dependence among data points, but it assumes identical distribution. Every Data Scientist fights this battle with ad-hoc solutions. In this paper, we claim that, due to the temporal dependence on the data, the existing solutions do not represent robust solutions to efficiently and automatically keep models relevant even when changes occur, and real-time processing is a must. We propose a novel and solid scientific foundation for Time-Evolving Analytics from this perspective. Such a framework aims to develop the logical, methodological, and algorithmic foundations for fast, scalable, and resilient analytics

    A survey on machine learning for recurring concept drifting data streams

    Get PDF
    The problem of concept drift has gained a lot of attention in recent years. This aspect is key in many domains exhibiting non-stationary as well as cyclic patterns and structural breaks affecting their generative processes. In this survey, we review the relevant literature to deal with regime changes in the behaviour of continuous data streams. The study starts with a general introduction to the field of data stream learning, describing recent works on passive or active mechanisms to adapt or detect concept drifts, frequent challenges in this area, and related performance metrics. Then, different supervised and non-supervised approaches such as online ensembles, meta-learning and model-based clustering that can be used to deal with seasonalities in a data stream are covered. The aim is to point out new research trends and give future research directions on the usage of machine learning techniques for data streams which can help in the event of shifts and recurrences in continuous learning scenarios in near real-time

    A Statistical Drift Detection Method

    Get PDF
    MasinĂ”ppemudelid eeldavad, et andmed pĂ€rinevad statsionaarsest jaotusest.Praktikas on tihti vaja mudelitega tĂ”lgendada andmeid, mis pĂ€rinevad kiiresti dĂŒnaamiliselt muutuvast andmevoost. Seda muutust Ă”ppe- ja testvalimis nimetatakse kontseptuaalseks triiviks (ingl k concept drift). Kontseptuaalse triivi olemasolu vĂ”ib kahjustada mudelennustuste tĂ€psust ja usaldusvÀÀrsust. SeetĂ”ttu on kontseptuaalse triivi arvestamine vĂ€ga oluline, et vĂ€hendada selle negatiivset mĂ”ju tulemustele. Kontseptuaalse triivi arvestamiseks tuleb see kĂ”igepealt tuvastada. Selle tuvastamiseks kasutatakse triivi detektoreid. Reaktiivsed kontseptuaalse triivi detektorid pĂŒĂŒavad tuvastada triivi niipe kui see ilmneb, jĂ€lgides aluseks oleva masinĂ”ppe mudeli toimimist. TĂ”lgendatavus on masinĂ”ppes tĂ€htis ja meetod vĂ”ib osutuda kasulikuks mitte ainult triivi olemasolu tuvastamiseks andmekogumis, vaid ka triivi pĂ”hjuste tuvastamisel ja analĂŒĂŒsimisel.KĂ€esolevas töös rĂ”hutatakse tĂ”lgendatavuse tĂ€htsust triivi tuvastamisel ja esitatakse statistilise triivi tuvastamise meetod (SDDM), mis tuvastab triivi kiiresti arenevates andmevoogudes, kusjuures vĂ”rdluses kaasaegsete meetoditega esineb vĂ€hem valepositiivseid ja valenegatiivsed tulemusi. Meetod annab ka kontseptuaalse triivi pĂ”hjuste tĂ”lgenduse. Töös nĂ€idatakse meetodi tĂ”husust, rakendades seda nii sĂŒnteetilistele kui ka reaalsetele andmekogumitele.Machine learning models assume that data is drawn from a stationary distribution. However, in practice, challenges are imposed on models that need to make sense of fast-evolving data streams, where the content of data is changing and evolving dynamically over time. This change between the underlying distributions of the training and test datasets is called concept drift. The presence of concept drift may compromise the accuracy and reliability of prospective computational predictions. Therefore, handling concept drift is of great importance in the direction of diminishing its negative effects on a model's performance. In order to handle concept drift, one has to detect it first. Concept drift detectors have been used to accomplish this - reactive concept drift detectors try to detect drift as soon as it occurs by monitoring the performance of the underlying machine learning model. However, the importance of interpretability in machine learning indicates that it may prove useful to not only detect that drift is occurring in the data, but to also identify and analyze the causes of the drift. In this thesis, the importance of interpretability in drift detection is highlighted and the Statistical Drift Detection Method (SDDM) is presented, which detects drifts in fast-evolving data streams with a smaller number of false positives and false negatives when compared to the state-of-the-art, and has the ability to interpret the cause of the concept drift. The effectiveness of the method is demonstrated by applying it on both synthetic and real-world datasets

    Data stream mining: methods and challenges for handling concept drift.

    Get PDF
    Mining and analysing streaming data is crucial for many applications, and this area of research has gained extensive attention over the past decade. However, there are several inherent problems that continue to challenge the hardware and the state-of-the art algorithmic solutions. Examples of such problems include the unbound size, varying speed and unknown data characteristics of arriving instances from a data stream. The aim of this research is to portray key challenges faced by algorithmic solutions for stream mining, particularly focusing on the prevalent issue of concept drift. A comprehensive discussion of concept drift and its inherent data challenges in the context of stream mining is presented, as is a critical, in-depth review of relevant literature. Current issues with the evaluative procedure for concept drift detectors is also explored, highlighting problems such as a lack of established base datasets and the impact of temporal dependence on concept drift detection. By exposing gaps in the current literature, this study suggests recommendations for future research which should aid in the progression of stream mining and concept drift detection algorithms

    Process-Oriented Stream Classification Pipeline:A Literature Review

    Get PDF
    Featured Application: Nowadays, many applications and disciplines work on the basis of stream data. Common examples are the IoT sector (e.g., sensor data analysis), or video, image, and text analysis applications (e.g., in social media analytics or astronomy). With our work, we gather different approaches and terminology, and give a broad overview over the topic. Our main target groups are practitioners and newcomers to the field of data stream classification. Due to the rise of continuous data-generating applications, analyzing data streams has gained increasing attention over the past decades. A core research area in stream data is stream classification, which categorizes or detects data points within an evolving stream of observations. Areas of stream classification are diverse—ranging, e.g., from monitoring sensor data to analyzing a wide range of (social) media applications. Research in stream classification is related to developing methods that adapt to the changing and potentially volatile data stream. It focuses on individual aspects of the stream classification pipeline, e.g., designing suitable algorithm architectures, an efficient train and test procedure, or detecting so-called concept drifts. As a result of the many different research questions and strands, the field is challenging to grasp, especially for beginners. This survey explores, summarizes, and categorizes work within the domain of stream classification and identifies core research threads over the past few years. It is structured based on the stream classification process to facilitate coordination within this complex topic, including common application scenarios and benchmarking data sets. Thus, both newcomers to the field and experts who want to widen their scope can gain (additional) insight into this research area and find starting points and pointers to more in-depth literature on specific issues and research directions in the field.</p
    corecore