75 research outputs found

    Expressive and modular rule-based classifier for data streams

    Get PDF
    The advances in computing software, hardware, connected devices and wireless communication infrastructure in recent years have led to the desire to work with streaming data sources. Yet the number of techniques, approaches and algorithms which can work with data from a streaming source is still very limited, compared with batched data. Although data mining techniques have been a well-studied topic of knowledge discovery for decades, many unique properties as well as challenges in learning from a data stream have not been considered properly due to the actual presence of and the real needs to mine information from streaming data sources. This thesis aims to contribute to the knowledge by developing a rule-based algorithm to specifically learn classification rules from data streams, with the learned rules are expressive so that a human user can easily interpret the concept and rationale behind the predictions of the created model. There are two main structures to represent a classification model; the ‘tree-based’ structure and the ‘rule-based’ structure. Even though both forms of representation are popular and well-known in traditional data mining, they are different when it comes to interpretability and quality of models in certain circumstances. The first part of this thesis analyses background work and relevant topics in learning classification rules from data streams. This study provides information about the essential requirements to produce high quality classification rules from data streams and how many systems, algorithms and techniques related to learn the classification of a static dataset are not applicable in a streaming environment. The second part of the thesis investigates at a new technique to improve the efficiency and accuracy in learning heuristics from numeric features from a streaming data source. The computational cost is one of the important factors to be considered for an effective and practical learning algorithm/system because of the needs to learn from continuous arrivals of data examples sequentially and discard the seen data examples. If the computing cost is too expensive, then one may not be able to keep pace with the arrival of high velocity and possibly unbound data streams. The proposed technique was first discussed in the context of the use of Gaussian distribution as heuristics for building rule terms on numeric features. Secondly, empirical evaluation shows the successful integration of the proposed technique into an existing rule-based algorithm for the data stream, eRules. Continuing on the topic of a rule-based algorithm for classification data streams, the use of Hoeffding’s Inequality addresses another problem in learning from a data stream, namely how much data should be seen from a data stream before starting learning and how to keep the model updated over time. By incorporating the theory from Hoeffding’s Inequality, this study presents the Hoeffding Rules algorithm, which can induce modular rules directly from a streaming data source with dynamic window sizes throughout the learning period to ensure the efficiency and robustness towards the concept drifts. Concept drift is another unique challenge in mining data streams which the underlying concept of the data can change either gradually or abruptly over time and the learner should adapt to these changes as quickly as possible. This research focuses on the development of a rule-based algorithm, Hoeffding Rules, for data stream which considers streaming environments as primary data sources and addresses several unique challenges in learning rules from data streams such as concept drifts and computational efficiency. This knowledge facilitates the need and the importance of an interpretable machine learning model; applying new studies to improve the ability to mine useful insights from potentially high velocity, high volume and unbounded data streams. More broadly, this research complements the study in learning classification rules from data streams to address some of the unique challenges in data streams compared with conventional batch data, with the knowledge necessary to systematically and effectively learn expressive and modular classification rules from data streams

    Solving the challenges of concept drift in data stream classification.

    Get PDF
    The rise of network connected devices and applications leads to a significant increase in the volume of data that are continuously generated overtime time, called data streams. In real world applications, storing the entirety of a data stream for analyzing later is often not practical, due to the data stream’s potentially infinite volume. Data stream mining techniques and frameworks are therefore created to analyze streaming data as they arrive. However, compared to traditional data mining techniques, challenges unique to data stream mining also emerge, due to the high arrival rate of data streams and their dynamic nature. In this dissertation, an array of techniques and frameworks are presented to improve the solutions on some of the challenges. First, this dissertation acknowledges that a “no free lunch” theorem exists for data stream mining, where no silver bullet solution can solve all problems of data stream mining. The dissertation focuses on detection of changes of data distribution in data stream mining. These changes are called concept drift. Concept drift can be categorized into many types. A detection algorithm often works only on some types of drift, but not all of them. Because of this, the dissertation finds specific techniques to solve specific challenges, instead of looking for a general solution. Then, this dissertation considers improving solutions for the challenges of high arrival rate of data streams. Data stream mining frameworks often need to process vast among of data samples in limited time. Some data mining activities, notably data sample labeling for classification, are too costly or too slow in such large scale. This dissertation presents two techniques that reduce the amount of labeling needed for data stream classification. The first technique presents a grid-based label selection process that apply to highly imbalanced data streams. Such data streams have one class of data samples vastly outnumber another class. Many majority class samples need to be labeled before a minority class sample can be found due to the imbalance. The presented technique divides the data samples into groups, called grids, and actively search for minority class samples that are close by within a grid. Experiment results show the technique can reduce the total number of data samples needed to be labeled. The second technique presents a smart preprocessing technique that reduce the number of times a new learning model needs to be trained due to concept drift. Less model training means less data labels required, and thus costs less. Experiment results show that in some cases the reduced performance of learning models is the result of improper preprocessing of the data, not due to concept drift. By adapting preprocessing to the changes in data streams, models can retain high performance without retraining. Acknowledging the high cost of labeling, the dissertation then considers the scenario where labels are unavailable when needed. The framework Sliding Reservoir Approach for Delayed Labeling (SRADL) is presented to explore solutions to such problem. SRADL tries to solve the delayed labeling problem where concept drift occurs, and no labels are immediately available. SRADL uses semi-supervised learning by employing a sliding windowed approach to store historical data, which is combined with newly unlabeled data to train new models. Experiments show that SRADL perform well in some cases of delayed labeling. Next, the dissertation considers improving solutions for the challenge of dynamism within data streams, most notably concept drift. The complex nature of concept drift means that most existing detection algorithms can only detect limited types of concept drift. To detect more types of concept drift, an ensemble approach that employs various algorithms, called Heuristic Ensemble Framework for Concept Drift Detection (HEFDD), is presented. The occurrence of each type of concept drift is voted on by the detection results of each algorithm in the ensemble. Types of concept drift with votes past majority are then declared detected. Experiment results show that HEFDD is able to improve detection accuracy significantly while reducing false positives. With the ability to detect various types of concept drift provided by HEFDD, the dissertation tries to improve the delayed labeling framework SRADL. A new combined framework, SRADL-HEFDD is presented, which produces synthetic labels to handle the unavailability of labels by human expert. SRADL-HEFDD employs different synthetic labeling techniques based on different types of drift detected by HEFDD. Experimental results show that comparing to the default SRADL, the combined framework improves prediction performance when small amount of labeled samples is available. Finally, as machine learning applications are increasingly used in critical domains such as medical diagnostics, accountability, explainability and interpretability of machine learning algorithms needs to be considered. Explainable machine learning aims to use a white box approach for data analytics, which enables learning models to be explained and interpreted by human users. However, few studies have been done on explaining what has changed in a dynamic data stream environment. This dissertation thus presents Data Stream Explainability (DSE) framework. DSE visualizes changes in data distribution and model classification boundaries between chunks of streaming data. The visualizations can then be used by a data mining researcher to generate explanations of what has changed within the data stream. To show that DSE can help average users understand data stream mining better, a survey was conducted with an expert group and a non-expert group of users. Results show DSE can reduce the gap of understanding what changed in data stream mining between the two groups

    New perspectives and methods for stream learning in the presence of concept drift.

    Get PDF
    153 p.Applications that generate data in the form of fast streams from non-stationary environments, that is,those where the underlying phenomena change over time, are becoming increasingly prevalent. In thiskind of environments the probability density function of the data-generating process may change overtime, producing a drift. This causes that predictive models trained over these stream data become obsoleteand do not adapt suitably to the new distribution. Specially in online learning scenarios, there is apressing need for new algorithms that adapt to this change as fast as possible, while maintaining goodperformance scores. Examples of these applications include making inferences or predictions based onfinancial data, energy demand and climate data analysis, web usage or sensor network monitoring, andmalware/spam detection, among many others.Online learning and concept drift are two of the most hot topics in the recent literature due to theirrelevance for the so-called Big Data paradigm, where nowadays we can find an increasing number ofapplications based on training data continuously available, named as data streams. Thus, learning in nonstationaryenvironments requires adaptive or evolving approaches that can monitor and track theunderlying changes, and adapt a model to accommodate those changes accordingly. In this effort, Iprovide in this thesis a comprehensive state-of-the-art approaches as well as I identify the most relevantopen challenges in the literature, while focusing on addressing three of them by providing innovativeperspectives and methods.This thesis provides with a complete overview of several related fields, and tackles several openchallenges that have been identified in the very recent state of the art. Concretely, it presents aninnovative way to generate artificial diversity in ensembles, a set of necessary adaptations andimprovements for spiking neural networks in order to be used in online learning scenarios, and finally, adrift detector based on this former algorithm. All of these approaches together constitute an innovativework aimed at presenting new perspectives and methods for the field

    Novel support vector machines for diverse learning paradigms

    Get PDF
    This dissertation introduces novel support vector machines (SVM) for the following traditional and non-traditional learning paradigms: Online classification, Multi-Target Regression, Multiple-Instance classification, and Data Stream classification. Three multi-target support vector regression (SVR) models are first presented. The first involves building independent, single-target SVR models for each target. The second builds an ensemble of randomly chained models using the first single-target method as a base model. The third calculates the targets\u27 correlations and forms a maximum correlation chain, which is used to build a single chained SVR model, improving the model\u27s prediction performance, while reducing computational complexity. Under the multi-instance paradigm, a novel SVM multiple-instance formulation and an algorithm with a bag-representative selector, named Multi-Instance Representative SVM (MIRSVM), are presented. The contribution trains the SVM based on bag-level information and is able to identify instances that highly impact classification, i.e. bag-representatives, for both positive and negative bags, while finding the optimal class separation hyperplane. Unlike other multi-instance SVM methods, this approach eliminates possible class imbalance issues by allowing both positive and negative bags to have at most one representative, which constitute as the most contributing instances to the model. Due to the shortcomings of current popular SVM solvers, especially in the context of large-scale learning, the third contribution presents a novel stochastic, i.e. online, learning algorithm for solving the L1-SVM problem in the primal domain, dubbed OnLine Learning Algorithm using Worst-Violators (OLLAWV). This algorithm, unlike other stochastic methods, provides a novel stopping criteria and eliminates the need for using a regularization term. It instead uses early stopping. Because of these characteristics, OLLAWV was proven to efficiently produce sparse models, while maintaining a competitive accuracy. OLLAWV\u27s online nature and success for traditional classification inspired its implementation, as well as its predecessor named OnLine Learning Algorithm - List 2 (OLLA-L2), under the batch data stream classification setting. Unlike other existing methods, these two algorithms were chosen because their properties are a natural remedy for the time and memory constraints that arise from the data stream problem. OLLA-L2\u27s low spacial complexity deals with memory constraints imposed by the data stream setting, and OLLAWV\u27s fast run time, early self-stopping capability, as well as the ability to produce sparse models, agrees with both memory and time constraints. The preliminary results for OLLAWV showed a superior performance to its predecessor and was chosen to be used in the final set of experiments against current popular data stream methods. Rigorous experimental studies and statistical analyses over various metrics and datasets were conducted in order to comprehensively compare the proposed solutions against modern, widely-used methods from all paradigms. The experimental studies and analyses confirm that the proposals achieve better performances and more scalable solutions than the methods compared, making them competitive in their respected fields

    Ensembles for Time Series Forecasting

    Get PDF

    Machine Learning based Restaurant Sales Forecasting

    Get PDF
    To encourage proper employee scheduling for managing crew load, restaurants have a need for accurate sales forecasting. We predict partitions of sales days, so each day is broken up into three sales periods: 10:00 AM-1:59 PM, 2:00 PM-5:59 PM, and 6:00 PM-10:00 PM. This study focuses on the middle timeslot, where sales forecasts should extend for one week. We gather three years of sales between 2016-2019 from a local restaurant, to generate a new dataset for researching sales forecasting methods. Outlined are methodologies used when going from raw data to a workable dataset. We test many machine learning models on the dataset, including recurrent neural network models. The test domain is extended by considering methods which remove trend and seasonality. The best model for one-day forecasting regression is ridge with an MAE of 214, and the best for one-week forecasting is the temporal fusion transformer with an MAE of 216
    • 

    corecore