4 research outputs found

    DetectA: abrupt concept drift detection in non-stationary environments

    Get PDF
    Almost all drift detection mechanisms designed for classification problems work reactively: after receiving the complete data set (input patterns and class labels) they apply a sequence of procedures to identify some change in the class-conditional distribution – a concept drift. However, detecting changes after its occurrence can be in some situations harmful to the process under analysis. This paper proposes a proactive approach for abrupt drift detection, called DetectA (Detect Abrupt Drift). Briefly, this method is composed of three steps: (i) label the patterns from the test set (an unlabelled data block), using an unsupervised method; (ii) compute some statistics from the train and test sets, conditioned to the given class labels for train set; and (iii) compare the training and testing statistics using a multivariate hypothesis test. Based on the results of the hypothesis tests, we attempt to detect the drift on the test set, before the real labels are obtained. A procedure for creating datasets with abrupt drift has been proposed to perform a sensitivity analysis of the DetectA model. The result of the sensitivity analysis suggests that the detector is efficient and suitable for datasets of high-dimensionality, blocks with any proportion of drifts, and datasets with class imbalance. The performance of the DetectA method, with different configurations, was also evaluated on real and artificial datasets, using an MLP as a classifier. The best results were obtained using one of the detection methods, being the proactive manner a top contender regarding improving the underlying base classifier accuracy

    Improvement on concept drift detection for online data streams

    Get PDF
    Orientadores: Ana Estela Antunes da Silva, André Leon Sampaio GradvohlDissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de TecnologiaResumo: Algoritmos clássicos de mineração de dados podem apresentar uma capacidade limitada quando são utilizados em fluxos de dados online. Isso ocorre porque esse tipo de fluxos de dados não apresenta um comportamento estático, i.e. a quantidade de dados que chegará, a velocidade de chegada dos dados e a duração dos fluxos costumam ser fatores desconhecidos e podem mudar ao longo do tempo. Além disso, em ambientes de aplicações reais o padrão de dados também pode mudar ao longo do tempo. Essa mudança que ocorre no padrão dos dados é chamada de Concept Drift e torna desaconselhável a utilização dos algoritmos clássicos de mineração de dados para essa tarefa. Por isso, é importante desenvolver algoritmos que sejam capazes de lidar com situações em que os algoritmos clássicos de mineração de dados não apresentam um desempenho satisfatório. Com base nesses desafios pesquisadores têm buscado desenvolver algoritmos que sejam capazes de identificar Concept Drifts de maneira rápida, já que isso previne que ocorra uma perda grande de acurácia que é motivada por erros de identificação de um novo padrão das instâncias de dados. Também é importante que o algoritmo seja rápido para que não seja necessário armazenar em memória temporária algumas instâncias de dados que ainda não foram processadas. Motivado por esses desafios esse trabalho propõe três propostas de melhoria na tarefa de detecção de Concept Drift em fluxos de dados online: o Fading, o Reduced Boundary e uma melhoria no gerenciamento da janela de dados do algoritmo-base que é utilizado nesse trabalho, o EDIST2 (KHAMASSI, SAYED-MOUCHAWEH et al., 2015) . Com essas propostas de melhoria foi possível, em alguns cenários de execução, reduzir o tempo de CPU, o consumo de memória RAM e a acurácia média em relação ao EDIST2. Os resultados que foram encontrados podem ser considerados promissores já que o algoritmo EDIST2 teve um desempenho superior ao desempenho de algoritmos conhecidos em mineração de dados como DDM, EDDM e ADWIN em termos de acurácia média, tempo de CPU e consumo de memória RAMAbstract: Classic data mining algorithms can show a limited capacity whenever used with online data streams. It happens because an online data stream does not show a static behavior, i.e. the data quantity, the velocity of arriving data and the stream duration use to be unknown factors and can change over time. Besides that, in real application environments data pattern can change over time as well. This data pattern change is called Concept Drift and it is not advisable use classic data mining algorithms for this task. Therefore, it is important to develop algorithms capable of handle situations whenever classic data mining algorithms does not have enough performance. Based on these challenges, researchers have been seeking develop algorithms capable of quickly identify Concept Drifts, since it avoids an accuracy lost that is caused by identification errors of a new data instance pattern. It is also important that the algorithm would be quick enough in order to avoid allocating temporary memory spaces for some data instances were not processed yet. Motivated by these challenges, this work proposes three different approaches for detecting Concept Drift patterns within online data streaming: Fading, Reduced Boundary and the enhancement on managing data-window from the base algorithm used into this work, EDIST2 (KHAMASSI, SAYED-MOUCHAWEH et al., 2015). Given these enhancement proposals it was possible, in some implementation scenarios, to reduce CPU time and RAM memory consuming, and improve the average accuracy relative to EDIST2 algorithm. Results were found can be considered promising, since EDIST2 algorithm had a superior performance against known data mining algorithms, such as DDM, EDDM and ADWIN in terms of average accuracy, CPU speed and RAM memory consumptionMestradoSistemas de Informação e ComunicaçãoMestre em Tecnologi

    Solving the challenges of concept drift in data stream classification.

    Get PDF
    The rise of network connected devices and applications leads to a significant increase in the volume of data that are continuously generated overtime time, called data streams. In real world applications, storing the entirety of a data stream for analyzing later is often not practical, due to the data stream’s potentially infinite volume. Data stream mining techniques and frameworks are therefore created to analyze streaming data as they arrive. However, compared to traditional data mining techniques, challenges unique to data stream mining also emerge, due to the high arrival rate of data streams and their dynamic nature. In this dissertation, an array of techniques and frameworks are presented to improve the solutions on some of the challenges. First, this dissertation acknowledges that a “no free lunch” theorem exists for data stream mining, where no silver bullet solution can solve all problems of data stream mining. The dissertation focuses on detection of changes of data distribution in data stream mining. These changes are called concept drift. Concept drift can be categorized into many types. A detection algorithm often works only on some types of drift, but not all of them. Because of this, the dissertation finds specific techniques to solve specific challenges, instead of looking for a general solution. Then, this dissertation considers improving solutions for the challenges of high arrival rate of data streams. Data stream mining frameworks often need to process vast among of data samples in limited time. Some data mining activities, notably data sample labeling for classification, are too costly or too slow in such large scale. This dissertation presents two techniques that reduce the amount of labeling needed for data stream classification. The first technique presents a grid-based label selection process that apply to highly imbalanced data streams. Such data streams have one class of data samples vastly outnumber another class. Many majority class samples need to be labeled before a minority class sample can be found due to the imbalance. The presented technique divides the data samples into groups, called grids, and actively search for minority class samples that are close by within a grid. Experiment results show the technique can reduce the total number of data samples needed to be labeled. The second technique presents a smart preprocessing technique that reduce the number of times a new learning model needs to be trained due to concept drift. Less model training means less data labels required, and thus costs less. Experiment results show that in some cases the reduced performance of learning models is the result of improper preprocessing of the data, not due to concept drift. By adapting preprocessing to the changes in data streams, models can retain high performance without retraining. Acknowledging the high cost of labeling, the dissertation then considers the scenario where labels are unavailable when needed. The framework Sliding Reservoir Approach for Delayed Labeling (SRADL) is presented to explore solutions to such problem. SRADL tries to solve the delayed labeling problem where concept drift occurs, and no labels are immediately available. SRADL uses semi-supervised learning by employing a sliding windowed approach to store historical data, which is combined with newly unlabeled data to train new models. Experiments show that SRADL perform well in some cases of delayed labeling. Next, the dissertation considers improving solutions for the challenge of dynamism within data streams, most notably concept drift. The complex nature of concept drift means that most existing detection algorithms can only detect limited types of concept drift. To detect more types of concept drift, an ensemble approach that employs various algorithms, called Heuristic Ensemble Framework for Concept Drift Detection (HEFDD), is presented. The occurrence of each type of concept drift is voted on by the detection results of each algorithm in the ensemble. Types of concept drift with votes past majority are then declared detected. Experiment results show that HEFDD is able to improve detection accuracy significantly while reducing false positives. With the ability to detect various types of concept drift provided by HEFDD, the dissertation tries to improve the delayed labeling framework SRADL. A new combined framework, SRADL-HEFDD is presented, which produces synthetic labels to handle the unavailability of labels by human expert. SRADL-HEFDD employs different synthetic labeling techniques based on different types of drift detected by HEFDD. Experimental results show that comparing to the default SRADL, the combined framework improves prediction performance when small amount of labeled samples is available. Finally, as machine learning applications are increasingly used in critical domains such as medical diagnostics, accountability, explainability and interpretability of machine learning algorithms needs to be considered. Explainable machine learning aims to use a white box approach for data analytics, which enables learning models to be explained and interpreted by human users. However, few studies have been done on explaining what has changed in a dynamic data stream environment. This dissertation thus presents Data Stream Explainability (DSE) framework. DSE visualizes changes in data distribution and model classification boundaries between chunks of streaming data. The visualizations can then be used by a data mining researcher to generate explanations of what has changed within the data stream. To show that DSE can help average users understand data stream mining better, a survey was conducted with an expert group and a non-expert group of users. Results show DSE can reduce the gap of understanding what changed in data stream mining between the two groups
    corecore