33 research outputs found

    Learning from Data Streams with Randomized Forests

    Get PDF
    Non-stationary streaming data poses a familiar challenge in machine learning: the need to obtain fast and accurate predictions. A data stream is a continuously generated sequence of data, with data typically arriving rapidly. They are often characterised by a non-stationary generative process, with concept drift occurring as the process changes. Such processes are commonly seen in the real world, such as in advertising, shopping trends, environmental conditions, electricity monitoring and traffic monitoring. Typical stationary algorithms are ill-suited for use with concept drifting data, thus necessitating more targeted methods. Tree-based methods are a popular approach to this problem, traditionally focussing on the use of the Hoeffding bound in order to guarantee performance relative to a stationary scenario. However, there are limited single learners available for regression scenarios, and those that do exist often struggle to choose between similarly discriminative splits, leading to longer training times and worse performance. This limited pool of single learners in turn hampers the performance of ensemble approaches in which they act as base learners. In this thesis we seek to remedy this gap in the literature, developing methods which focus on increasing randomization to both improve predictive performance and reduce the training times of tree-based ensemble methods. In particular, we have chosen to investigate the use of randomization as it is known to be able to improve generalization error in ensembles, and is also expected to lead to fast training times, thus being a natural method of handling the problems typically experienced by single learners. We begin in a regression scenario, introducing the Adaptive Trees for Streaming with Extreme Randomization (ATSER) algorithm; a partially randomized approach based on the concept of Extremely Randomized (extra) trees. The ATSER algorithm incrementally trains trees, using the Hoeffding bound to select the best of a random selection of splits. Simultaneously, the trees also detect and adapt to changes in the data stream. Unlike many traditional streaming algorithms ATSER trees can easily be extended to include nominal features. We find that compared to other contemporary methods ensembles of ATSER trees lead to improved predictive performance whilst also reducing run times. We then demonstrate the Adaptive Categorisation Trees for Streaming with Extreme Randomization (ACTSER) algorithm, an adaption of the ATSER algorithm to the more traditional categorization scenario, again showing improved predictive performance and reduced runtimes. The inclusion of nominal features is particularly novel in this setting since typical categorization approaches struggle to handle them. Finally we examine a completely randomized scenario, where an ensemble of trees is generated prior to having access to the data stream, while also considering multivariate splits in addition to the traditional axis-aligned approach. We find that through the combination of a forgetting mechanism in linear models and dynamic weighting for ensemble members, we are able to avoid explicitly testing for concept drift. This leads to fast ensembles with strong predictive performance, whilst also requiring fewer parameters than other contemporary methods. For each of the proposed methods in this thesis, we demonstrate empirically that they are effective over a variety of different non-stationary data streams, including on multiple types of concept drift. Furthermore, in comparison to other contemporary data streaming algorithms, we find the biggest improvements in performance are on noisy data streams.Engineers Gat

    Improving decision tree and neural network learning for evolving data-streams

    Get PDF
    High-throughput real-time Big Data stream processing requires fast incremental algorithms that keep models consistent with most recent data. In this scenario, Hoeffding Trees are considered the state-of-the-art single classifier for processing data streams and they are widely used in ensemble combinations. This thesis is devoted to the improvement of the performance of algorithms for machine learning/artificial intelligence on evolving data streams. In particular, we focus on improving the Hoeffding Tree classifier and its ensemble combinations, in order to reduce its resource consumption and its response time latency, achieving better throughput when processing evolving data streams. First, this thesis presents a study on using Neural Networks (NN) as an alternative method for processing data streams. The use of random features for improving NNs training speed is proposed and important issues are highlighted about the use of NN on a data stream setup. These issues motivated this thesis to go in the direction of improving the current state-of-the-art methods: Hoeffding Trees and their ensemble combinations. Second, this thesis proposes the Echo State Hoeffding Tree (ESHT), as an extension of the Hoeffding Tree to model time-dependencies typically present in data streams. The capabilities of the new proposed architecture on both regression and classification problems are evaluated. Third, a new methodology to improve the Adaptive Random Forest (ARF) is developed. ARF has been introduced recently, and it is considered the state-of-the-art classifier in the MOA framework (a popular framework for processing evolving data streams). This thesis proposes the Elastic Swap Random Forest, an extension to ARF that reduces the number of base learners in the ensemble down to one third on average, while providing similar accuracy than the standard ARF with 100 trees. And finally, a last contribution on a multi-threaded high performance scalable ensemble design that is highly adaptable to a variety of hardware platforms, ranging from server-class to edge computing. The proposed design achieves throughput improvements of 85x (Intel i7), 143x (Intel Xeon parsing from memory), 10x (Jetson TX1, ARM) and 23x (X-Gene2, ARM) compared to single-threaded MOA on i7. In addition, the proposal achieves 75% parallel efficiency when using 24 cores on the Intel Xeon.Procesar grandes flujos de datos (Big Data Streams, BDS) en tiempo real requiere el uso de algoritmos incrementales rápidos que mantengan los modelos consistentes con los datos más recientes. En este escenario, los Hoeffding Trees (HT) se consideran el clasificador simple más avanzado para procesar BDS, razon por la cual son ampliamente usados como base a la hora de combinar clasificadores en Ensembles. Esta tesis está dedicada a la mejora del rendimiento de algoritmos para Machine Learning/Iteligencia Artificial en BDS que evolucionan con el tiempo (es decir, BDS cuya distribución estadística cambia con el tiempo). En particular, nuestro objetivo es mejorar el Hoeffding Tree y sus combinaciones en Ensembles, con el objetivo de reducir el consumo de recursos y la latencia en el tiempo de respuesta, logrando un mejor rendimiento al procesar BDS que evolucionan en el tiempo. Primero, se presenta un estudio sobre el uso de redes neuronales (NN) con parámetros aleatorios como un método alternativo para procesar BDS con el objetivo de mejorar la velocidad de entrenamiento de Nns. También se destacan problemas importantes derivados del uso de NN para BDS. Como consecuencia, esta tesis tomo la dirección de mejorar los métodos de vanguardia en BDS: Hoeffding Trees y sus combinaciones en Ensembles. Segundo, se propone el Echo State Hoeffding Tree (ESHT), como una extensión del HT para modelar las dependencias temporales típicamente presentes en BDS. La nueva arquitectura propuesta se evalúa tanto en problemas de regresión como de clasificación. Tercero, se propone una extensión para el Adaptive Random Forest (ARF), publicado recientemente y considerado como el clasificador mas potente implementado en MOA (un framework muy popular para procesar BDS). Proponemos el Elastic Swap Random Forest para reducir el número de clasificadores en el ensemble a un tercio en promedio, al tiempo se mantiene un accuracy similar a la de un ARF estándar con 100 árboles. Finalmente, la última contribución de esta tesis es una arquitectura de Ensembles multi hilo para procesar BDS. Nuestro diseño es altamente adaptable a una variedad de plataformas de hardware, que van desde servidores hasta pequeños dispositivos en el Edge Computing (pej, Internet de las Cosas). El diseño propuesto logra mejoras de rendimiento de 85x (Intel i7), 143x (análisis de Intel Xeon desde la memoria), 10x (Jetson TX1, ARM) y 23x (X-Gene2, ARM) en comparación con MOA (un solo proceso) en un Intel i7. Además, la propuesta logra una eficiencia paralela del 75 \% cuando se usan 24 núcleos en el Intel Xeon.Postprint (published version

    A fuzzy kernel c-means clustering model for handling concept drift in regression

    Full text link
    © 2017 IEEE. Concept drift, given the huge volume of high-speed data streams, requires traditional machine learning models to be self-adaptive. Techniques to handle drift are especially needed in regression cases for a wide range of applications in the real world. There is, however, a shortage of research on drift adaptation for regression cases in the literature. One of the main obstacles to further research is the resulting model complexity when regression methods and drift handling techniques are combined. This paper proposes a self-adaptive algorithm, based on a fuzzy kernel c-means clustering approach and a lazy learning algorithm, called FKLL, to handle drift in regression learning. Using FKLL, drift adaptation first updates the learning set using lazy learning, then fuzzy kernel c-means clustering is used to determine the most relevant learning set. Experiments show that the FKLL algorithm is better able to respond to drift as soon as the learning sets are updated, and is also suitable for dealing with reoccurring drift, when compared to the original lazy learning algorithm and other state-of-the-art regression methods

    On ensemble techniques for data stream regression

    Get PDF
    An ensemble of learners tends to exceed the predictive performance of individual learners. This approach has been explored for both batch and online learning. Ensembles methods applied to data stream classification were thoroughly investigated over the years, while their regression counterparts received less attention in comparison. In this work, we discuss and analyze several techniques for generating, aggregating, and updating ensembles of regressors for evolving data streams. We investigate the impact of different strategies for inducing diversity into the ensemble by randomizing the input data (resampling, random subspaces and random patches). On top of that, we devote particular attention to techniques that adapt the ensemble model in response to concept drifts, including adaptive window approaches, fixed periodical resets and randomly determined windows. Extensive empirical experiments show that simple techniques can obtain similar predictive performance to sophisticated algorithms that rely on reactive adaptation (i.e., concept drift detection and recovery)

    Effective Use Methods for Continuous Sensor Data Streams in Manufacturing Quality Control

    Get PDF
    This work outlines an approach for managing sensor data streams of continuous numerical data in product manufacturing settings, emphasizing statistical process control, low computational and memory overhead, and saving information necessary to reduce the impact of nonconformance to quality specifications. While there is extensive literature, knowledge, and documentation about standard data sources and databases, the high volume and velocity of sensor data streams often makes traditional analysis unfeasible. To that end, an overview of data stream fundamentals is essential. An analysis of commonly used stream preprocessing and load shedding methods follows, succeeded by a discussion of aggregation procedures. Stream storage and querying systems are the next topics. Further, existing machine learning techniques for data streams are presented, with a focus on regression. Finally, the work describes a novel methodology for managing sensor data streams in which data stream management systems save and record aggregate data from small time intervals, and individual measurements from the stream that are nonconforming. The aggregates shall be continually entered into control charts and regressed on. To conserve memory, old data shall be periodically reaggregated at higher levels to reduce memory consumption

    Towards Reliable Machine Learning in Evolving Data Streams

    Get PDF
    Data streams are ubiquitous in many areas of modern life. For example, applications in healthcare, education, finance, or advertising often deal with large-scale and evolving data streams. Compared to stationary applications, data streams pose considerable additional challenges for automated decision making and machine learning. Indeed, online machine learning methods must cope with limited memory capacities, real-time requirements, and drifts in the data generating process. At the same time, online learning methods should provide a high predictive quality, stability in the presence of input noise, and good interpretability in order to be reliably used in practice. In this thesis, we address some of the most important aspects of machine learning in evolving data streams. Specifically, we identify four open issues related to online feature selection, concept drift detection, online classification, local explainability, and the evaluation of online learning methods. In these contexts, we present new theoretical and empirical findings as well as novel frameworks and implementations. In particular, we propose new approaches for online feature selection and concept drift detection that can account for model uncertainties and thus achieve more stable results. Moreover, we introduce a new incremental decision tree that retains valuable interpretability properties and a new change detection framework that allows for more efficient explanations based on local feature attributions. In fact, this is one of the first works to address intrinsic model interpretability and local explainability in the presence of incremental updates and concept drift. Along with this thesis, we provide extensive open resources related to online machine learning. Notably, we introduce a new Python framework that enables simplified and standardized evaluations and can thus serve as a basis for more comparable online learning experiments in the future. In total, this thesis is based on six publications, five of which were peer-reviewed at the time of publication of this thesis. Our work touches all major areas of predictive modeling in data streams and proposes novel solutions for efficient, stable, interpretable and thus reliable online machine learning.Datenströme sind in vielen Bereichen des modernen Lebens allgegenwärtig. Beispielsweise haben Anwendungen im Gesundheitswesen, im Bildungswesen, im Finanzwesen oder in der Werbung häufig mit großen und sich verändernden Datenströmen zu tun. Im Vergleich zu stationären Anwendungen stellen Datenströme eine erhebliche zusätzliche Herausforderung für die automatisierte Entscheidungsfindung und das maschinelle Lernen dar. So müssen Online Machine Learning-Verfahren mit begrenzten Speicherkapazitäten, Echtzeitanforderungen und Veränderungen des Daten-generierenden Prozesses zurechtkommen. Gleichzeitig sollten Online Learning-Verfahren eine hohe Vorhersagequalität, Stabilität bei Eingangsrauschen und eine gute Interpretierbarkeit aufweisen, um in der Praxis zuverlässig eingesetzt werden zu können. In dieser Arbeit befassen wir uns mit einigen der wichtigsten Aspekte des maschinellen Lernens in sich entwickelnden Datenströmen. Insbesondere identifizieren wir vier offene Fragen im Zusammenhang mit Online Feature Selection, Concept Drift Detection, Online-Klassifikation, lokaler Erklärbarkeit und der Bewertung von Online Learning-Methoden. In diesem Kontext präsentieren wir neue theoretische und empirische Erkenntnisse sowie neue Frameworks und Implementierungen. Insbesondere schlagen wir neue Ansätze für Online Feature Selection und Concept Drift Detection vor, die Unsicherheiten im Modell berücksichtigen und dadurch stabilere Ergebnisse erzielen können. Darüber hinaus stellen wir einen neuen inkrementellen Entscheidungsbaum vor, der wertvolle Eigenschaften hinsichtlich der Interpretierbarkeit einhält, sowie ein neues Framework zur Erkennung von Veränderungen, das effizientere Erklärungen auf der Grundlage lokaler Feature Attributions ermöglicht. Tatsächlich ist dies eine der ersten Arbeiten, die sich mit intrinsischer Interpretierbarkeit von Modellen und lokaler Erklärbarkeit bei inkrementellen Aktualisierungen und Concept Drift befasst. Gemeinsam mit dieser Arbeit stellen wir umfangreiche Ressourcen für Online Machine Learning zur Verfügung. Insbesondere stellen wir ein neues Python-Framework vor, das vereinfachte und standardisierte Auswertungen ermöglicht und künftig somit als Grundlage für vergleichbare Online Learning-Experimente dienen kann. Insgesamt stützt sich diese Arbeit auf sechs Publikationen, von denen fünf zum Zeitpunkt der Veröffentlichung der Dissertation bereits im Peer-Review Format begutachtet wurden. Unsere Arbeit berührt alle wichtigen Bereiche der prädiktiven Modellierung in Datenströmen und schlägt neuartige Lösungen für effizientes, stabiles, interpretierbares und damit zuverlässiges Online Machine Learning vor

    Ensembles of Adaptive Model Rules from High-Speed Data Streams

    Get PDF
    The volume and velocity of data is increasing at astonishing rates. In order to extract knowledge from this huge amount of information there is a need for efficient on-line learning algorithms. Rule-based algorithms produce models that are easy to understand and can be used almost offhand. Ensemble methods combine several predicting models to improve the quality of prediction. In this paper, a new on-line ensemble method that combines a set of rule-based models is proposed to solve regression problems from data streams. Experimental results using synthetic and real time-evolving data streams show the proposed method significantly improves the performance of the single rule-based learner, and outperforms two state-of-the-art regression algorithms for data streams

    Anomaly Detections for Manufacturing Systems Based on Sensor Data—Insights into Two Challenging Real-World Production Settings

    Get PDF
    To build, run, and maintain reliable manufacturing machines, the condition of their components has to be continuously monitored. When following a fine-grained monitoring of these machines, challenges emerge pertaining to the (1) feeding procedure of large amounts of sensor data to downstream processing components and the (2) meaningful analysis of the produced data. Regarding the latter aspect, manifold purposes are addressed by practitioners and researchers. Two analyses of real-world datasets that were generated in production settings are discussed in this paper. More specifically, the analyses had the goals (1) to detect sensor data anomalies for further analyses of a pharma packaging scenario and (2) to predict unfavorable temperature values of a 3D printing machine environment. Based on the results of the analyses, it will be shown that a proper management of machines and their components in industrial manufacturing environments can be efficiently supported by the detection of anomalies. The latter shall help to support the technical evangelists of the production companies more properly

    A Survey on Semi-Supervised Learning for Delayed Partially Labelled Data Streams

    Full text link
    Unlabelled data appear in many domains and are particularly relevant to streaming applications, where even though data is abundant, labelled data is rare. To address the learning problems associated with such data, one can ignore the unlabelled data and focus only on the labelled data (supervised learning); use the labelled data and attempt to leverage the unlabelled data (semi-supervised learning); or assume some labels will be available on request (active learning). The first approach is the simplest, yet the amount of labelled data available will limit the predictive performance. The second relies on finding and exploiting the underlying characteristics of the data distribution. The third depends on an external agent to provide the required labels in a timely fashion. This survey pays special attention to methods that leverage unlabelled data in a semi-supervised setting. We also discuss the delayed labelling issue, which impacts both fully supervised and semi-supervised methods. We propose a unified problem setting, discuss the learning guarantees and existing methods, explain the differences between related problem settings. Finally, we review the current benchmarking practices and propose adaptations to enhance them
    corecore