2,861 research outputs found

    A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

    Full text link
    Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures on how to evaluate these algorithms. This work presents a taxonomy of algorithms for imbalanced data streams and proposes a standardized, exhaustive, and informative experimental testbed to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to the largest experimental study conducted so far in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental testbed is fully reproducible and easy to extend with new methods. This way we propose the first standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create trustworthy and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams

    Improving decision tree and neural network learning for evolving data-streams

    Get PDF
    High-throughput real-time Big Data stream processing requires fast incremental algorithms that keep models consistent with most recent data. In this scenario, Hoeffding Trees are considered the state-of-the-art single classifier for processing data streams and they are widely used in ensemble combinations. This thesis is devoted to the improvement of the performance of algorithms for machine learning/artificial intelligence on evolving data streams. In particular, we focus on improving the Hoeffding Tree classifier and its ensemble combinations, in order to reduce its resource consumption and its response time latency, achieving better throughput when processing evolving data streams. First, this thesis presents a study on using Neural Networks (NN) as an alternative method for processing data streams. The use of random features for improving NNs training speed is proposed and important issues are highlighted about the use of NN on a data stream setup. These issues motivated this thesis to go in the direction of improving the current state-of-the-art methods: Hoeffding Trees and their ensemble combinations. Second, this thesis proposes the Echo State Hoeffding Tree (ESHT), as an extension of the Hoeffding Tree to model time-dependencies typically present in data streams. The capabilities of the new proposed architecture on both regression and classification problems are evaluated. Third, a new methodology to improve the Adaptive Random Forest (ARF) is developed. ARF has been introduced recently, and it is considered the state-of-the-art classifier in the MOA framework (a popular framework for processing evolving data streams). This thesis proposes the Elastic Swap Random Forest, an extension to ARF that reduces the number of base learners in the ensemble down to one third on average, while providing similar accuracy than the standard ARF with 100 trees. And finally, a last contribution on a multi-threaded high performance scalable ensemble design that is highly adaptable to a variety of hardware platforms, ranging from server-class to edge computing. The proposed design achieves throughput improvements of 85x (Intel i7), 143x (Intel Xeon parsing from memory), 10x (Jetson TX1, ARM) and 23x (X-Gene2, ARM) compared to single-threaded MOA on i7. In addition, the proposal achieves 75% parallel efficiency when using 24 cores on the Intel Xeon.Procesar grandes flujos de datos (Big Data Streams, BDS) en tiempo real requiere el uso de algoritmos incrementales rápidos que mantengan los modelos consistentes con los datos más recientes. En este escenario, los Hoeffding Trees (HT) se consideran el clasificador simple más avanzado para procesar BDS, razon por la cual son ampliamente usados como base a la hora de combinar clasificadores en Ensembles. Esta tesis está dedicada a la mejora del rendimiento de algoritmos para Machine Learning/Iteligencia Artificial en BDS que evolucionan con el tiempo (es decir, BDS cuya distribución estadística cambia con el tiempo). En particular, nuestro objetivo es mejorar el Hoeffding Tree y sus combinaciones en Ensembles, con el objetivo de reducir el consumo de recursos y la latencia en el tiempo de respuesta, logrando un mejor rendimiento al procesar BDS que evolucionan en el tiempo. Primero, se presenta un estudio sobre el uso de redes neuronales (NN) con parámetros aleatorios como un método alternativo para procesar BDS con el objetivo de mejorar la velocidad de entrenamiento de Nns. También se destacan problemas importantes derivados del uso de NN para BDS. Como consecuencia, esta tesis tomo la dirección de mejorar los métodos de vanguardia en BDS: Hoeffding Trees y sus combinaciones en Ensembles. Segundo, se propone el Echo State Hoeffding Tree (ESHT), como una extensión del HT para modelar las dependencias temporales típicamente presentes en BDS. La nueva arquitectura propuesta se evalúa tanto en problemas de regresión como de clasificación. Tercero, se propone una extensión para el Adaptive Random Forest (ARF), publicado recientemente y considerado como el clasificador mas potente implementado en MOA (un framework muy popular para procesar BDS). Proponemos el Elastic Swap Random Forest para reducir el número de clasificadores en el ensemble a un tercio en promedio, al tiempo se mantiene un accuracy similar a la de un ARF estándar con 100 árboles. Finalmente, la última contribución de esta tesis es una arquitectura de Ensembles multi hilo para procesar BDS. Nuestro diseño es altamente adaptable a una variedad de plataformas de hardware, que van desde servidores hasta pequeños dispositivos en el Edge Computing (pej, Internet de las Cosas). El diseño propuesto logra mejoras de rendimiento de 85x (Intel i7), 143x (análisis de Intel Xeon desde la memoria), 10x (Jetson TX1, ARM) y 23x (X-Gene2, ARM) en comparación con MOA (un solo proceso) en un Intel i7. Además, la propuesta logra una eficiencia paralela del 75 \% cuando se usan 24 núcleos en el Intel Xeon.Postprint (published version

    The GC3 framework : grid density based clustering for classification of streaming data with concept drift.

    Get PDF
    Data mining is the process of discovering patterns in large sets of data. In recent years there has been a paradigm shift in how the data is viewed. Instead of considering the data as static and available in databases, data is now regarded as a stream as it continuously flows into the system. One of the challenges posed by the stream is its dynamic nature, which leads to a phenomenon known as Concept Drift. This causes a need for stream mining algorithms which are adaptive incremental learners capable of evolving and adjusting to the changes in the stream. Several models have been developed to deal with Concept Drift. These systems are discussed in this thesis and a new system, the GC3 framework is proposed. The GC3 framework leverages the advantages of the Gris Density based Clustering and the Ensemble based classifiers for streaming data, to be able to detect the cause of the drift and deal with it accordingly. In order to demonstrate the functionality and performance of the framework a synthetic data stream called the TJSS stream is developed, which embodies a variety of drift scenarios, and the model’s behavior is analyzed over time. Experimental evaluation with the synthetic stream and two real world datasets demonstrated high prediction capability of the proposed system with a small ensemble size and labeling ratio. Comparison of the methodology with a traditional static model with no drifts detection capability and with existing ensemble techniques for stream classification, showed promising results. Also, the analysis of data structures maintained by the framework provided interpretability into the dynamics of the drift over time. The experimentation analysis of the GC3 framework shows it to be promising for use in dynamic drifting environments where concepts can be incrementally learned in the presence of only partially labeled data

    A New Large Scale SVM for Classification of Imbalanced Evolving Streams

    Get PDF
    Classification from imbalanced evolving streams possesses a combined challenge of class imbalance and concept drift (CI-CD). However, the state of imbalance is dynamic, a kind of virtual concept drift. The imbalanced distributions and concept drift hinder the online learner’s performance as a combined or individual problem. A weighted hybrid online oversampling approach,”weighted online oversampling large scale support vector machine (WOOLASVM),” is proposed in this work to address this combined problem. The WOOLASVM is an SVM active learning approach with new boundary weighing strategies such as (i) dynamically oversampling the current boundary and (ii) dynamic weighing of the cost parameter of the SVM objective function. Thus at any time step, WOOLASVM maintains balanced class distributions so that the CI-CD problem does not hinder the online learner performance. Over extensive experiments on synthetic and real-world streams with the static and dynamic state of imbalance, the WOOLASVM exhibits better online classification performances than other state-of-the-art methods

    Ensemble based on randomised neural networks for online data stream regression in presence of concept drift

    Get PDF
    The big data paradigm has posed new challenges for the Machine Learning algorithms, such as analysing continuous flows of data, in the form of data streams, and dealing with the evolving nature of the data, which cause a phenomenon often referred to in the literature as concept drift. Concept drift is caused by inconsistencies between the optimal hypotheses in two subsequent chunks of data, whereby the concept underlying a given process evolves over time, which can happen due to several factors including change in consumer preference, economic dynamics, or environmental conditions. This thesis explores the problem of data stream regression with the presence of concept drift. This problem requires computationally efficient algorithms that are able to adapt to the various types of drift that may affect the data. The development of effective algorithms for data streams with concept drift requires several steps that are discussed in this research. The first one is related to the datasets required to assess the algorithms. In general, it is not possible to determine the occurrence of concept drift on real-world datasets; therefore, synthetic datasets where the various types of concept drift can be simulated are required. The second issue is related to the choice of the algorithm. The ensemble algorithms show many advantages to deal with concept drifting data streams, which include flexibility, computational efficiency and high accuracy. For the design of an effective ensemble, this research analyses the use of randomised Neural Networks as base models, along with their optimisation. The optimisation of the randomised Neural Networks involves design and tuning hyperparameters which may substantially affect its performance. The optimisation of the base models is an important aspect to build highly accurate and computationally efficient ensembles. To cope with the concept drift, the existing methods either require setting fixed updating points, which may result in unnecessary computations or slow reaction to concept drift, or rely on drifting detection mechanism, which may be ineffective due to the difficulty to detect drift in real applications. Therefore, the research contributions of this thesis include the development of a new approach for synthetic dataset generation, development of a new hyperparameter optimisation algorithm that reduces the search effort and the need of prior assumptions compared to existing methods, the analysis of the effects of randomised Neural Networks hyperparameters, and the development of a new ensemble algorithm based on bagging meta-model that reduces the computational effort over existing methods and uses an innovative updating mechanism to cope with concept drift. The algorithms have been tested on synthetic datasets and validated on four real-world datasets from various application domains
    corecore