62 research outputs found
Adaptive random forests for evolving data stream classification
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests algorithm that can be considered state-of-the-art in comparison to bagging and boosting based algorithms. In this work, we present the adaptive random forest (ARF) algorithm for classification of evolving data streams. In contrast to previous attempts of replicating random forests for data stream learning, ARF includes an effective resampling method and adaptive operators that can cope with different types of concept drifts without complex optimizations for different data sets. We present experiments with a parallel implementation of ARF which has no degradation in terms of classification performance in comparison to a serial implementation, since trees and adaptive operators are independent from one another. Finally, we compare ARF with state-of-the-art algorithms in a traditional test-then-train evaluation and a novel delayed labelling evaluation, and show that ARF is accurate and uses a feasible amount of resources
Scikit-Multiflow: A Multi-output Streaming Framework
Scikit-multiflow is a multi-output/multi-label and stream data mining
framework for the Python programming language. Conceived to serve as a platform
to encourage democratization of stream learning research, it provides multiple
state of the art methods for stream learning, stream generators and evaluators.
scikit-multiflow builds upon popular open source frameworks including
scikit-learn, MOA and MEKA. Development follows the FOSS principles and quality
is enforced by complying with PEP8 guidelines and using continuous integration
and automatic testing. The source code is publicly available at
https://github.com/scikit-multiflow/scikit-multiflow.Comment: 5 pages, Open Source Softwar
Online GentleAdaBoost -- Technical Report
We study the online variant of GentleAdaboost, where we combine a weak
learner to a strong learner in an online fashion. We provide an approach to
extend the batch approach to an online approach with theoretical justifications
through application of line search. Finally we compare our online boosting
approach with other online approaches across a variety of benchmark datasets
Text classification supervised algorithms with term frequency inverse document frequency and global vectors for word representation: a comparative study
Over the course of the previous two decades, there has been a rise in the quantity of text documents stored digitally. The ability to organize and categorize those documents in an automated mechanism, is known as text categorization which is used to classify them into a set of predefined categories so they may be preserved and sorted more efficiently. Identifying appropriate structures, architectures, and methods for text classification presents a challenge for researchers. This is due to the significant impact this concept has on content management, contextual search, opinion mining, product review analysis, spam filtering, and text sentiment mining. This study analyzes the generic categorization strategy and examines supervised machine learning approaches and their ability to comprehend complex models and nonlinear data interactions. Among these methods are k-nearest neighbors (KNN), support vector machine (SVM), and ensemble learning algorithms employing various evaluation techniques. Thereafter, an evaluation is conducted on the constraints of every technique and how they can be applied to real-life situations
Rebalancing Learning on Evolving Data Streams
Nowadays, every device connected to the Internet generates an ever-growing
stream of data (formally, unbounded). Machine Learning on unbounded data
streams is a grand challenge due to its resource constraints. In fact, standard
machine learning techniques are not able to deal with data whose statistics is
subject to gradual or sudden changes without any warning. Massive Online
Analysis (MOA) is the collective name, as well as a software library, for new
learners that are able to manage data streams. In this paper, we present a
research study on streaming rebalancing. Indeed, data streams can be imbalanced
as static data, but there is not a method to rebalance them incrementally, one
element at a time. For this reason we propose a new streaming approach able to
rebalance data streams online. Our new methodology is evaluated against some
synthetically generated datasets using prequential evaluation in order to
demonstrate that it outperforms the existing approaches
On the performance of deep learning models for time series classification in streaming
Processing data streams arriving at high speed requires the development of
models that can provide fast and accurate predictions. Although deep neural
networks are the state-of-the-art for many machine learning tasks, their
performance in real-time data streaming scenarios is a research area that has
not yet been fully addressed. Nevertheless, there have been recent efforts to
adapt complex deep learning models for streaming tasks by reducing their
processing rate. The design of the asynchronous dual-pipeline deep learning
framework allows to predict over incoming instances and update the model
simultaneously using two separate layers. The aim of this work is to assess the
performance of different types of deep architectures for data streaming
classification using this framework. We evaluate models such as multi-layer
perceptrons, recurrent, convolutional and temporal convolutional neural
networks over several time-series datasets that are simulated as streams. The
obtained results indicate that convolutional architectures achieve a higher
performance in terms of accuracy and efficiency.Comment: Paper submitted to the 15th International Conference on Soft
Computing Models in Industrial and Environmental Applications (SOCO 2020
Data streams classification using deep learning under different speeds and drifts
Processing data streams arriving at high speed requires the development of models that can provide fast and accurate
predictions. Although deep neural networks are the state-of-the-art for many machine learning tasks, their performance in
real-time data streaming scenarios is a research area that has not yet been fully addressed. Nevertheless, much effort has
been put into the adaption of complex deep learning (DL) models to streaming tasks by reducing the processing time. The
design of the asynchronous dual-pipeline DL framework allows making predictions of incoming instances and updating the
model simultaneously, using two separate layers. The aim of this work is to assess the performance of different types of DL
architectures for data streaming classification using this framework. We evaluate models such as multi-layer perceptrons,
recurrent, convolutional and temporal convolutional neural networks over several time series datasets that are simulated as
streams at different speeds. In addition, we evaluate how the different architectures react to concept drifts typically found in
evolving data streams. The obtained results indicate that convolutional architectures achieve a higher performance in terms
of accuracy and efficiency, but are also the most sensitive to concept drifts.Ministerio de Ciencia, Innovación y Universidades PID2020-117954RB-C22Junta de Andalucía US-1263341Junta de Andalucía P18-RT-277
- …