41 research outputs found

    Novel support vector machines for diverse learning paradigms

    Get PDF
    This dissertation introduces novel support vector machines (SVM) for the following traditional and non-traditional learning paradigms: Online classification, Multi-Target Regression, Multiple-Instance classification, and Data Stream classification. Three multi-target support vector regression (SVR) models are first presented. The first involves building independent, single-target SVR models for each target. The second builds an ensemble of randomly chained models using the first single-target method as a base model. The third calculates the targets\u27 correlations and forms a maximum correlation chain, which is used to build a single chained SVR model, improving the model\u27s prediction performance, while reducing computational complexity. Under the multi-instance paradigm, a novel SVM multiple-instance formulation and an algorithm with a bag-representative selector, named Multi-Instance Representative SVM (MIRSVM), are presented. The contribution trains the SVM based on bag-level information and is able to identify instances that highly impact classification, i.e. bag-representatives, for both positive and negative bags, while finding the optimal class separation hyperplane. Unlike other multi-instance SVM methods, this approach eliminates possible class imbalance issues by allowing both positive and negative bags to have at most one representative, which constitute as the most contributing instances to the model. Due to the shortcomings of current popular SVM solvers, especially in the context of large-scale learning, the third contribution presents a novel stochastic, i.e. online, learning algorithm for solving the L1-SVM problem in the primal domain, dubbed OnLine Learning Algorithm using Worst-Violators (OLLAWV). This algorithm, unlike other stochastic methods, provides a novel stopping criteria and eliminates the need for using a regularization term. It instead uses early stopping. Because of these characteristics, OLLAWV was proven to efficiently produce sparse models, while maintaining a competitive accuracy. OLLAWV\u27s online nature and success for traditional classification inspired its implementation, as well as its predecessor named OnLine Learning Algorithm - List 2 (OLLA-L2), under the batch data stream classification setting. Unlike other existing methods, these two algorithms were chosen because their properties are a natural remedy for the time and memory constraints that arise from the data stream problem. OLLA-L2\u27s low spacial complexity deals with memory constraints imposed by the data stream setting, and OLLAWV\u27s fast run time, early self-stopping capability, as well as the ability to produce sparse models, agrees with both memory and time constraints. The preliminary results for OLLAWV showed a superior performance to its predecessor and was chosen to be used in the final set of experiments against current popular data stream methods. Rigorous experimental studies and statistical analyses over various metrics and datasets were conducted in order to comprehensively compare the proposed solutions against modern, widely-used methods from all paradigms. The experimental studies and analyses confirm that the proposals achieve better performances and more scalable solutions than the methods compared, making them competitive in their respected fields

    Incremental Market Behavior Classification in Presence of Recurring Concepts

    Get PDF
    In recent years, the problem of concept drift has gained importance in the financial domain. The succession of manias, panics and crashes have stressed the non-stationary nature and the likelihood of drastic structural or concept changes in the markets. Traditional systems are unable or slow to adapt to these changes. Ensemble-based systems are widely known for their good results predicting both cyclic and non-stationary data such as stock prices. In this work, we propose RCARF (Recurring Concepts Adaptive Random Forests), an ensemble tree-based online classifier that handles recurring concepts explicitly. The algorithm extends the capabilities of a version of Random Forest for evolving data streams, adding on top a mechanism to store and handle a shared collection of inactive trees, called concept history, which holds memories of the way market operators reacted in similar circumstances. This works in conjunction with a decision strategy that reacts to drift by replacing active trees with the best available alternative: either a previously stored tree from the concept history or a newly trained background tree. Both mechanisms are designed to provide fast reaction times and are thus applicable to high-frequency data. The experimental validation of the algorithm is based on the prediction of price movement directions one second ahead in the SPDR (Standard & Poor's Depositary Receipts) S&P 500 Exchange-Traded Fund. RCARF is benchmarked against other popular methods from the incremental online machine learning literature and is able to achieve competitive results.This research was funded by the Spanish Ministry of Economy and Competitiveness under grant number ENE2014-56126-C2-2-R

    Online transfer learning for concept drifting data streams

    Get PDF

    Forecasting COVID-19 Cases Using Dynamic Time Warping and Incremental Machine Learning Methods

    Get PDF
    The investment of time and resources for developing better strategies is key to dealing with future pandemics. In this work, we recreated the situation of COVID-19 across the year 2020, when the pandemic started spreading worldwide. We conducted experiments to predict the coronavirus cases for the 50 countries with the most cases during 2020. We compared the performance of state-of-the-art machine learning algorithms, such as long-short-term memory networks, against that of online incremental machine learning algorithms. To find the best strategy, we performed experiments to test three different approaches. In the first approach (single-country), we trained each model using data only from the country we were predicting. In the second one (multiple-country), we trained a model using the data from the 50 countries, and we used that model to predict each of the 50 countries. In the third experiment, we first applied clustering to calculate the nine most similar countries to the country that we were predicting. We consider two countries to be similar if the differences between the curve that represents the COVID-19 time series are small. To do so, we used time series similarity measures (TSSM) such as Euclidean Distance (ED) and Dynamic Time Warping (DTW). TSSM return a real value that represents the distance between the points in two time series which can be interpreted as how similar they are. Then, we trained the models with the data from the nine more similar countries to the one that was predicted and the predicted one. We used the model ARIMA as a baseline for our results. Results show that the idea of using TSSM is a very effective approach. By using it with the ED, the obtained RMSE in the singlecountry and multiple-country approaches was reduced by 74.21% and 74.70%, respectively. And by using the DTW, the RMSE was reduced by 74.89% and 75.36%. The main advantage of our methodology is that it is very simple and fast to apply since it is only based on time series data, as opposed to more complex methodologies that require a deep and thorough study to consider the number of parameters involved in the spread of the virus and their corresponding values. We made our code public to allow other researchers to explore our proposed methodology

    Adaptive Algorithms For Classification On High-Frequency Data Streams: Application To Finance

    Get PDF
    Mención Internacional en el título de doctorIn recent years, the problem of concept drift has gained importance in the financial domain. The succession of manias, panics and crashes have stressed the nonstationary nature and the likelihood of drastic structural changes in financial markets. The most recent literature suggests the use of conventional machine learning and statistical approaches for this. However, these techniques are unable or slow to adapt to non-stationarities and may require re-training over time, which is computationally expensive and brings financial risks. This thesis proposes a set of adaptive algorithms to deal with high-frequency data streams and applies these to the financial domain. We present approaches to handle different types of concept drifts and perform predictions using up-to-date models. These mechanisms are designed to provide fast reaction times and are thus applicable to high-frequency data. The core experiments of this thesis are based on the prediction of the price movement direction at different intraday resolutions in the SPDR S&P 500 exchange-traded fund. The proposed algorithms are benchmarked against other popular methods from the data stream mining literature and achieve competitive results. We believe that this thesis opens good research prospects for financial forecasting during market instability and structural breaks. Results have shown that our proposed methods can improve prediction accuracy in many of these scenarios. Indeed, the results obtained are compatible with ideas against the efficient market hypothesis. However, we cannot claim that we can beat consistently buy and hold; therefore, we cannot reject it.Programa de Doctorado en Ciencia y Tecnología Informática por la Universidad Carlos III de MadridPresidente: Gustavo Recio Isasi.- Secretario: Pedro Isasi Viñuela.- Vocal: Sandra García Rodrígue

    Ensemble based on randomised neural networks for online data stream regression in presence of concept drift

    Get PDF
    The big data paradigm has posed new challenges for the Machine Learning algorithms, such as analysing continuous flows of data, in the form of data streams, and dealing with the evolving nature of the data, which cause a phenomenon often referred to in the literature as concept drift. Concept drift is caused by inconsistencies between the optimal hypotheses in two subsequent chunks of data, whereby the concept underlying a given process evolves over time, which can happen due to several factors including change in consumer preference, economic dynamics, or environmental conditions. This thesis explores the problem of data stream regression with the presence of concept drift. This problem requires computationally efficient algorithms that are able to adapt to the various types of drift that may affect the data. The development of effective algorithms for data streams with concept drift requires several steps that are discussed in this research. The first one is related to the datasets required to assess the algorithms. In general, it is not possible to determine the occurrence of concept drift on real-world datasets; therefore, synthetic datasets where the various types of concept drift can be simulated are required. The second issue is related to the choice of the algorithm. The ensemble algorithms show many advantages to deal with concept drifting data streams, which include flexibility, computational efficiency and high accuracy. For the design of an effective ensemble, this research analyses the use of randomised Neural Networks as base models, along with their optimisation. The optimisation of the randomised Neural Networks involves design and tuning hyperparameters which may substantially affect its performance. The optimisation of the base models is an important aspect to build highly accurate and computationally efficient ensembles. To cope with the concept drift, the existing methods either require setting fixed updating points, which may result in unnecessary computations or slow reaction to concept drift, or rely on drifting detection mechanism, which may be ineffective due to the difficulty to detect drift in real applications. Therefore, the research contributions of this thesis include the development of a new approach for synthetic dataset generation, development of a new hyperparameter optimisation algorithm that reduces the search effort and the need of prior assumptions compared to existing methods, the analysis of the effects of randomised Neural Networks hyperparameters, and the development of a new ensemble algorithm based on bagging meta-model that reduces the computational effort over existing methods and uses an innovative updating mechanism to cope with concept drift. The algorithms have been tested on synthetic datasets and validated on four real-world datasets from various application domains

    Genetic programming-based regression for temporal data

    Get PDF
    Various machine learning techniques exist to perform regression on temporal data with concept drift occurring. However, there are numerous nonstationary environments where these techniques may fail to either track or detect the changes. This study develops a genetic programming-based predictive model for temporal data with a numerical target that tracks changes in a dataset due to concept drift. When an environmental change is evident, the proposed algorithm reacts to the change by clustering the data and then inducing nonlinear models that describe generated clusters. Nonlinear models become terminal nodes of genetic programming model trees. Experiments were carried out using seven nonstationary datasets and the obtained results suggest that the proposed model yields high adaptation rates and accuracy to several types of concept drifts. Future work will consider strengthening the adaptation to concept drift and the fast implementation of genetic programming on GPUs to provide fast learning for high-speed temporal data.http://link.springer.com/journal/107102022-05-09hj2021Computer Scienc

    A Review of Meta-level Learning in the Context of Multi-component, Multi-level Evolving Prediction Systems.

    Get PDF
    The exponential growth of volume, variety and velocity of data is raising the need for investigations of automated or semi-automated ways to extract useful patterns from the data. It requires deep expert knowledge and extensive computational resources to find the most appropriate mapping of learning methods for a given problem. It becomes a challenge in the presence of numerous configurations of learning algorithms on massive amounts of data. So there is a need for an intelligent recommendation engine that can advise what is the best learning algorithm for a dataset. The techniques that are commonly used by experts are based on a trial and error approach evaluating and comparing a number of possible solutions against each other, using their prior experience on a specific domain, etc. The trial and error approach combined with the expert’s prior knowledge, though computationally and time expensive, have been often shown to work for stationary problems where the processing is usually performed off-line. However, this approach would not normally be feasible to apply on non-stationary problems where streams of data are continuously arriving. Furthermore, in a non-stationary environment the manual analysis of data and testing of various methods every time when there is a change in the underlying data distribution would be very difficult or simply infeasible. In that scenario and within an on-line predictive system, there are several tasks where Meta-learning can be used to effectively facilitate best recommendations including: 1) pre processing steps, 2) learning algorithms or their combination, 3) adaptivity mechanisms and their parameters, 4) recurring concept extraction, and 5) concept drift detection. However, while conceptually very attractive and promising, the Meta-learning leads to several challenges with the appropriate representation of the problem at a meta-level being one of the key ones. The goal of this review and our research is, therefore, to investigate Meta learning in general and the associated challenges in the context of automating the building, deployment and adaptation of multi-level and multi-component predictive system that evolve over time

    Large-Scale Traffic Flow Prediction Using Deep Learning in the Context of Smart Mobility

    Get PDF
    Designing and developing a new generation of cities around the world (termed as smart cities) is fast becoming one of the ultimate solutions to overcome cities' problems such as population growth, pollution, energy crisis, and pressure demand on existing transportation infrastructure. One of the major aspects of a smart city is smart mobility. Smart mobility aims at improving transportation systems in several aspects: city logistics, info-mobility, and people-mobility. The emergence of the Internet of Car (IoC) phenomenon alongside with the development of Intelligent Transportation Systems (ITSs) opens some opportunities in improving the tra c management systems and assisting the travelers and authorities in their decision-making process. However, this has given rise to the generation of huge amount of data originated from human-device and device-device interaction. This is an opportunity and a challenge, and smart mobility will not meet its full potential unless valuable insights are extracted from these big data. Although the smart city environment and IoC allow for the generation and exchange of large amounts of data, there have not been yet well de ned and mature approaches for mining this wealth of information to bene t the drivers and traffic authorities. The main reason is most likely related to fundamental challenges in dealing with big data of various types and uncertain frequency coming from diverse sources. Mainly, the issues of types of data and uncertainty analysis in the predictions are indicated as the most challenging areas of study that have not been tackled yet. Important issues such as the nature of the data, i.e., stationary or non-stationary, and the prediction tasks, i.e., short-term or long-term, should also be taken into consideration. Based on this observation, a data-driven traffic flow prediction framework within the context of big data environment is proposed in this thesis. The main goal of this framework is to enhance the quality of traffic flow predictions, which can be used to assist travelers and traffic authorities in the decision-making process (whether for travel or management purposes). The proposed framework is focused around four main aspects that tackle major data-driven traffic flow prediction problems: the fusion of hard data for traffic flow prediction; the fusion of soft data for traffic flow prediction; prediction of non-stationary traffic flow; and prediction of multi-step traffic flow. All these aspects are investigated and formulated as computational based tools/algorithms/approaches adequately tailored to the nature of the data at hand. The first tool tackles the inherent big data problems and deals with the uncertainty in the prediction. It relies on the ability of deep learning approaches in handling huge amounts of data generated by a large-scale and complex transportation system with limited prior knowledge. Furthermore, motivated by the close correlation between road traffic and weather conditions, a novel deep-learning-based approach that predicts traffic flow by fusing the traffic history and weather data is proposed. The second tool fuses the streams of data (hard data) and event-based data (soft data) using Dempster Shafer Evidence Theory (DSET). One of the main features of the DSET is its ability to capture uncertainties in probabilities. Subsequently, an extension of DSET, namely Dempsters conditional rules for updating belief, is used to fuse traffic prediction beliefs coming from streams of data and event-based data sources. The third tool consists of a method to detect non-stationarities in the traffic flow and an algorithm to perform online adaptations of the tra c prediction model. The proposed detection approach is developed by monitoring the evolution of the spectral contents of the traffic flow. Furthermore, the approach is specfi cally developed to work in conjunction with state-of-the-art machine learning methods such as Deep Neural Network (DNN). By combining the power of frequency domain features and the known generalization capability and scalability of DNN in handling real-world data, it is expected that high prediction performances can be achieved. The last tool is developed to improve multi-step traffic flow prediction in the recursive and multi-output settings. In the recursive setting, an algorithm that augments the information about the current time-step is proposed. This algorithm is called Conditional Data as Demonstrator (C-DaD) and is an extension of an algorithm called Data as Demonstrator (DaD). Furthermore, in the multi-output setting, a novel approach of generating new history-future pairs of data that are aggregated with the original training data using Conditional Generative Adversarial Network (C-GAN) is developed. To demonstrate the capabilities of the proposed approaches, a series of experiments using arti cial and real-world data are conducted. Each of the proposed approaches is compared with the state-of-the-art or currently existing approaches
    corecore