Search CORE

9,849 research outputs found

K-means for Evolving Data Streams

Author: Bidaurrazaga A.
Capó M.
Pérez A.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

Nowadays, streaming data analysis has become a relevant area of research in machine learning. Most of the data streams available are unlabeled, and thus it is necessary to develop specific clustering techniques that take into account the particularities of the streaming data. In streaming data scenarios, the data is composed of an increasing sequence of batches of samples where the concept drift phenomenon may occur. In this work, we formally define the streaming K -means (SKM) problem, which implies a restart of the error function when a concept drift occurs. An approximated error function that does not rely on concept drift detection is proposed. We prove that such a surrogate is a good approximation of the SKM error. Then, we introduce an algorithm to deal with SKM problem by minimizing the surrogate error function each time a new batch arrives. Alternative initialization criteria are presented and theoretically analyzed for streaming data scenarios. Among them, we develop and analyze theoretically two initialization methods that search for the best trade-off between the importance that is given to the past and the current batches. The experiments show that the proposed algorithm with, the proposed initialization criteria, obtain the best results when dealing with the SKM problem without requiring to detect when concept drift takes place

BCAM's Institutional Repository Data

Request-and-Reverify: Hierarchical Hypothesis Testing for Concept Drift Detection with Expensive Labels

Author: Principe Jose C.
Wang Xiaoyang
Yu Shujian
Publication venue: 'International Joint Conferences on Artificial Intelligence'
Publication date: 28/06/2018
Field of study

One important assumption underlying common classification models is the stationarity of the data. However, in real-world streaming applications, the data concept indicated by the joint distribution of feature and label is not stationary but drifting over time. Concept drift detection aims to detect such drifts and adapt the model so as to mitigate any deterioration in the model's predictive performance. Unfortunately, most existing concept drift detection methods rely on a strong and over-optimistic condition that the true labels are available immediately for all already classified instances. In this paper, a novel Hierarchical Hypothesis Testing framework with Request-and-Reverify strategy is developed to detect concept drifts by requesting labels only when necessary. Two methods, namely Hierarchical Hypothesis Testing with Classification Uncertainty (HHT-CU) and Hierarchical Hypothesis Testing with Attribute-wise "Goodness-of-fit" (HHT-AG), are proposed respectively under the novel framework. In experiments with benchmark datasets, our methods demonstrate overwhelming advantages over state-of-the-art unsupervised drift detectors. More importantly, our methods even outperform DDM (the widely used supervised drift detector) when we use significantly fewer labels.Comment: Published as a conference paper at IJCAI 201

arXiv.org e-Print Archive

Crossref

Mining multi-dimensional concept-drifting data streams using Bayesian network classifiers

Author: Bielza Lozoya María Concepción
Borchani Hanen
Gama João
Larrañaga Múgica Pedro María
Publication venue: 'IOS Press'
Publication date: 01/01/2016
Field of study

In recent years, a plethora of approaches have been proposed to deal with the increasingly challenging task of mining concept-drifting data streams. However, most of these approaches can only be applied to uni-dimensional classification problems where each input instance has to be assigned to a single output class variable. The problem of mining multi-dimensional data streams, which includes multiple output class variables, is largely unexplored and only few streaming multi-dimensional approaches have been recently introduced. In this paper, we propose a novel adaptive method, named Locally Adaptive-MB-MBC (LA-MB-MBC), for mining streaming multi-dimensional data. To this end, we make use of multi-dimensional Bayesian network classifiers (MBCs) as models. Basically, LA-MB-MBC monitors the concept drift over time using the average log-likelihood score and the Page-Hinkley test. Then, if a concept drift is detected, LA-MB-MBC adapts the current MBC network locally around each changed node. An experimental study carried out using synthetic multi-dimensional data streams shows the merits of the proposed method in terms of concept drift detection as well as classification performance

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital UPM

Concept drift and machine learning model for detecting fraudulent transactions in streaming environment

Author: Patil Rudragoud
Shahapurkar Arati
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/10/2023
Field of study

In a streaming environment, data is continuously generated and processed in an ongoing manner, and it is necessary to detect fraudulent transactions quickly to prevent significant financial losses. Hence, this paper proposes a machine learning-based approach for detecting fraudulent transactions in a streaming environment, with a focus on addressing concept drift. The approach utilizes the extreme gradient boosting (XGBoost) algorithm. Additionally, the approach employs four algorithms for detecting continuous stream drift. To evaluate the effectiveness of the approach, two datasets are used: a credit card dataset and a Twitter dataset containing financial fraud-related social media data. The approach is evaluated using cross-validation and the results demonstrate that it outperforms traditional machine learning models in terms of accuracy, precision, and recall, and is more robust to concept drift. The proposed approach can be utilized as a real-time fraud detection system in various industries, including finance, insurance, and e-commerce

Institute of Advanced Engineering and Science

Don’t Pay for Validation: Detecting Drifts from Unlabeled data Using Margin Density

Author: Kantardzic Mehmed
Sethi Tegjyot Singh
Publication venue: The Authors. Published by Elsevier B.V.
Publication date: 31/12/2015
Field of study

AbstractValidating online stream classifiers has traditionally assumed the availability of labeled samples, which can be monitored over time, to detect concept drift. However, labeling in streaming domains is expensive, time consuming and in certain applications, such as land mine detection, not a possibility at all. In this paper, the Margin Density Drift Detection (MD3) approach is proposed, which can signal change using unlabeled samples and requires labeling only for retraining, in the event of a drift. The MD3 approach when evaluated on 5 synthetic and 5 real world drifting data streams, produced statistically equivalent classification accuracy to that of a fully labeled accuracy tracking drift detector, and required only a third of the samples to be labeled, on average

Elsevier - Publisher Connector

Mining Butterflies in Streaming Graphs

Author: Sheshbolouki Aida
Publication venue: 'University of Waterloo'
Publication date: 15/05/2023
Field of study

This thesis introduces two main-memory systems sGrapp and sGradd for performing the fundamental analytic tasks of biclique counting and concept drift detection over a streaming graph. A data-driven heuristic is used to architect the systems. To this end, initially, the growth patterns of bipartite streaming graphs are mined and the emergence principles of streaming motifs are discovered. Next, the discovered principles are (a) explained by a graph generator called sGrow; and (b) utilized to establish the requirements for efficient, effective, explainable, and interpretable management and processing of streams. sGrow is used to benchmark stream analytics, particularly in the case of concept drift detection. sGrow displays robust realization of streaming growth patterns independent of initial conditions, scale and temporal characteristics, and model configurations. Extensive evaluations confirm the simultaneous effectiveness and efficiency of sGrapp and sGradd. sGrapp achieves mean absolute percentage error up to 0.05/0.14 for the cumulative butterfly count in streaming graphs with uniform/non-uniform temporal distribution and a processing throughput of 1.5 million data records per second. The throughput and estimation error of sGrapp are 160x higher and 0.02x lower than baselines. sGradd demonstrates an improving performance over time, achieves zero false detection rates when there is not any drift and when drift is already detected, and detects sequential drifts in zero to a few seconds after their occurrence regardless of drift intervals

University of Waterloo's Institutional Repository

A survey on detecting healthcare concept drift in AI/ML models from a finance perspective

Author: Abdul Razak M. S.
Hassan Fareed M. Lahza
Husam Lahza
Nirmala C. R.
Sreenivasa B. R.
Publication venue: 'Frontiers Media SA'
Publication date: 01/04/2023
Field of study

Data is incredibly significant in today's digital age because data represents facts and numbers from our regular life transactions. Data is no longer arriving in a static form; it is now arriving in a streaming fashion. Data streams are the arrival of limitless, continuous, and rapid data. The healthcare industry is a major generator of data streams. Processing data streams is extremely complex due to factors such as volume, pace, and variety. Data stream classification is difficult owing to idea drift. Concept drift occurs in supervised learning when the statistical properties of the target variable that the model predicts change unexpectedly. We focused on solving various forms of concept drift problems in healthcare data streams in this research, and we outlined the existing statistical and machine learning methodologies for dealing with concept drift. It also emphasizes the use of deep learning algorithms for concept drift detection and describes the various healthcare datasets utilized for concept drift detection in data stream categorization

Directory of Open Access Journals