Search CORE

5 research outputs found

Combining univariate approaches for ensemble change detection in multivariate data

Author: Faithfull William
Kuncheva Ludmila
Rodriguez Juan
Publication venue
Publication date: 01/01/2019
Field of study

Combining univariate approaches for ensemble change detection in multivariate data

Author: Faithfull William J. .
Kuncheva Ludmila I. .
Rodríguez Diez Juan José
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

Detecting change in multivariate data is a challenging problem, especially when class labels are not available. There is a large body of research on univariate change detection, notably in control charts developed originally for engineering applications. We evaluate univariate change detection approaches —including those in the MOA framework — built into ensembles where each member observes a feature in the input space of an unsupervised change detection problem. We present a comparison between the ensemble combinations and three established ‘pure’ multivariate approaches over 96 data sets, and a case study on the KDD Cup 1999 network intrusion detection dataset. We found that ensemble combination of univariate methods consistently outperformed multivariate methods on the four experimental metrics.project RPG-2015-188 funded by The Leverhulme Trust, UK; Spanish Ministry of Economy and Competitiveness through project TIN 2015-67534-P and the Spanish Ministry of Education, Culture and Sport through Mobility Grant PRX16/00495. The 96 datasets were originally curated for use in the work of Fernández-Delgado et al. [53] and accessed from the personal web page of the author5. The KDD Cup 1999 dataset used in the case study was accessed from the UCI Machine Learning Repository [10

Repositorio Institucional de la Universidad de Burgos

PERBANDINGAN IMPUTASI DAN PARAMETER SUPPORT VECTOR REGRESSION UNTUK PERAMALAN CUACA

Author: Cholidhazia Putri
Priyatno Arif Mudi
Syuhada Fahmi
Wiratmo Agung
Publication venue: 'Universitas Muria Kudus'
Publication date: 29/11/2019
Field of study

Curah hujan adalah informasi penting di bidang transportasi, pertanian, industri dll. Dengan mengetahui informasi curah hujan, tindakan dapat diambil secara tepat di beberapa bidang tersebut. sehingga tidak ada kerugian karena kesalahan dalam informasi curah hujan. Makalah ini bertujuan untuk menemukan metode yang sesuai dalam peramalan curah hujan yang terkait dengan metode pemrosesan data imputasi dan nilai parameter dalam Support Vector Regression (SVR). Hasil percobaan menunjukkan bahwa metode preprocessing data imputasi terbaik diperoleh untuk digunakan ke dalam SVR berdasarkan nilai Mean Squared Error (MSE) dan Mean Absolute Error (MAE). Berdasarkan hasil MSE, k-nearest neighbor adalah metode terbaik yang digunakan untuk preprocessing data imputasi. Data preprocessing menghasilkan eksperimen pada SVR Polinomial dengan parameter C 1000, toleransi 0,001, epsilon 0,01 dan iterasi tak terbatas. Di sisi lain, hasil MAE menunjukkan bahwa Artificial Neural Network (ANN) adalah metode terbaik dalam imputasi data preprocessing. ANN dengan radial basis function kernel, gamma 0,001, C 1000, toleransi 0,001 dan iterasi tanpa batas. JST diuji pada RBF SVR dengan gamma 0,001, C 1000, toleransi 0,001 dan iterasi tak terbatas

E-Journal Universitas Muria Kudus

Solving the challenges of concept drift in data stream classification.

Author: Hu Hanqing
Publication venue: ThinkIR: The University of Louisville\u27s Institutional Repository
Publication date: 01/08/2022
Field of study

The rise of network connected devices and applications leads to a significant increase in the volume of data that are continuously generated overtime time, called data streams. In real world applications, storing the entirety of a data stream for analyzing later is often not practical, due to the data stream’s potentially infinite volume. Data stream mining techniques and frameworks are therefore created to analyze streaming data as they arrive. However, compared to traditional data mining techniques, challenges unique to data stream mining also emerge, due to the high arrival rate of data streams and their dynamic nature. In this dissertation, an array of techniques and frameworks are presented to improve the solutions on some of the challenges. First, this dissertation acknowledges that a “no free lunch” theorem exists for data stream mining, where no silver bullet solution can solve all problems of data stream mining. The dissertation focuses on detection of changes of data distribution in data stream mining. These changes are called concept drift. Concept drift can be categorized into many types. A detection algorithm often works only on some types of drift, but not all of them. Because of this, the dissertation finds specific techniques to solve specific challenges, instead of looking for a general solution. Then, this dissertation considers improving solutions for the challenges of high arrival rate of data streams. Data stream mining frameworks often need to process vast among of data samples in limited time. Some data mining activities, notably data sample labeling for classification, are too costly or too slow in such large scale. This dissertation presents two techniques that reduce the amount of labeling needed for data stream classification. The first technique presents a grid-based label selection process that apply to highly imbalanced data streams. Such data streams have one class of data samples vastly outnumber another class. Many majority class samples need to be labeled before a minority class sample can be found due to the imbalance. The presented technique divides the data samples into groups, called grids, and actively search for minority class samples that are close by within a grid. Experiment results show the technique can reduce the total number of data samples needed to be labeled. The second technique presents a smart preprocessing technique that reduce the number of times a new learning model needs to be trained due to concept drift. Less model training means less data labels required, and thus costs less. Experiment results show that in some cases the reduced performance of learning models is the result of improper preprocessing of the data, not due to concept drift. By adapting preprocessing to the changes in data streams, models can retain high performance without retraining. Acknowledging the high cost of labeling, the dissertation then considers the scenario where labels are unavailable when needed. The framework Sliding Reservoir Approach for Delayed Labeling (SRADL) is presented to explore solutions to such problem. SRADL tries to solve the delayed labeling problem where concept drift occurs, and no labels are immediately available. SRADL uses semi-supervised learning by employing a sliding windowed approach to store historical data, which is combined with newly unlabeled data to train new models. Experiments show that SRADL perform well in some cases of delayed labeling. Next, the dissertation considers improving solutions for the challenge of dynamism within data streams, most notably concept drift. The complex nature of concept drift means that most existing detection algorithms can only detect limited types of concept drift. To detect more types of concept drift, an ensemble approach that employs various algorithms, called Heuristic Ensemble Framework for Concept Drift Detection (HEFDD), is presented. The occurrence of each type of concept drift is voted on by the detection results of each algorithm in the ensemble. Types of concept drift with votes past majority are then declared detected. Experiment results show that HEFDD is able to improve detection accuracy significantly while reducing false positives. With the ability to detect various types of concept drift provided by HEFDD, the dissertation tries to improve the delayed labeling framework SRADL. A new combined framework, SRADL-HEFDD is presented, which produces synthetic labels to handle the unavailability of labels by human expert. SRADL-HEFDD employs different synthetic labeling techniques based on different types of drift detected by HEFDD. Experimental results show that comparing to the default SRADL, the combined framework improves prediction performance when small amount of labeled samples is available. Finally, as machine learning applications are increasingly used in critical domains such as medical diagnostics, accountability, explainability and interpretability of machine learning algorithms needs to be considered. Explainable machine learning aims to use a white box approach for data analytics, which enables learning models to be explained and interpreted by human users. However, few studies have been done on explaining what has changed in a dynamic data stream environment. This dissertation thus presents Data Stream Explainability (DSE) framework. DSE visualizes changes in data distribution and model classification boundaries between chunks of streaming data. The visualizations can then be used by a data mining researcher to generate explanations of what has changed within the data stream. To show that DSE can help average users understand data stream mining better, a survey was conducted with an expert group and a non-expert group of users. Results show DSE can reduce the gap of understanding what changed in data stream mining between the two groups

University of Louisville