68 research outputs found
Incremental learning of concept drift from imbalanced data
Learning data sampled from a nonstationary distribution has been shown to be a very challenging problem in machine learning, because the joint probability distribution between the data and classes evolve over time. Thus learners must adapt their knowledge base, including their structure or parameters, to remain as strong predictors. This phenomenon of learning from an evolving data source is akin to learning how to play a game while the rules of the game are changed, and it is traditionally referred to as learning concept drift. Climate data, financial data, epidemiological data, spam detection are examples of applications that give rise to concept drift problems. An additional challenge arises when the classes to be learned are not represented (approximately) equally in the training data, as most machine learning algorithms work well only when the class distributions are balanced. However, rare categories are commonly faced in real-world applications, which leads to skewed or imbalanced datasets. Fraud detection, rare disease diagnosis, anomaly detection are examples of applications that feature imbalanced datasets, where data from category are severely underrepresented. Concept drift and class imbalance are traditionally addressed separately in machine learning, yet data streams can experience both phenomena. This work introduces Learn++.NIE (nonstationary & imbalanced environments) and Learn++.CDS (concept drift with SMOTE) as two new members of the Learn++ family of incremental learning algorithms that explicitly and simultaneously address the aforementioned phenomena. The former addresses concept drift and class imbalance through modified bagging-based sampling and replacing a class independent error weighting mechanism - which normally favors majority class - with a set of measures that emphasize good predictive accuracy on all classes. The latter integrates Learn++.NSE, an algorithm for concept drift, with the synthetic sampling method known as SMOTE, to cope with class imbalance. This research also includes a thorough evaluation of Learn++.CDS and Learn++.NIE on several real and synthetic datasets and on several figures of merit, showing that both algorithms are able to learn in some of the most difficult learning environments
Covariate shift detection-based nonstationary adaptation in motor-imagery-based brain–computer interface
Nonstationary learning refers to the process that can learn patterns from data, adapt to shifts, and improve performance of the system with its experience while operating in the nonstationary environments (NSEs). Covariate shift (CS) presents a major challenge during data processing within NSEs wherein the input-data distribution shifts during transitioning from training to testing phase. CS is one of the fundamental issues in electroencephalogram (EEG)-based brain-computer interface (BCI) systems and can be often observed during multiple trials of EEG data recorded over different sessions. Thus, conventional learning algorithms struggle to accommodate these CSs in streaming EEG data resulting in low performance (in terms of classification accuracy) of motor imagery (MI)-related BCI systems. This chapter aims to introduce a novel framework for nonstationary adaptation in MI-related BCI system based on CS detection applied to the temporal and spatial filtered features extracted from raw EEG signals. The chapter collectively provides an efficient method for accounting nonstationarity in EEG data during learning in NSEs
A Survey on Semi-Supervised Learning for Delayed Partially Labelled Data Streams
Unlabelled data appear in many domains and are particularly relevant to
streaming applications, where even though data is abundant, labelled data is
rare. To address the learning problems associated with such data, one can
ignore the unlabelled data and focus only on the labelled data (supervised
learning); use the labelled data and attempt to leverage the unlabelled data
(semi-supervised learning); or assume some labels will be available on request
(active learning). The first approach is the simplest, yet the amount of
labelled data available will limit the predictive performance. The second
relies on finding and exploiting the underlying characteristics of the data
distribution. The third depends on an external agent to provide the required
labels in a timely fashion. This survey pays special attention to methods that
leverage unlabelled data in a semi-supervised setting. We also discuss the
delayed labelling issue, which impacts both fully supervised and
semi-supervised methods. We propose a unified problem setting, discuss the
learning guarantees and existing methods, explain the differences between
related problem settings. Finally, we review the current benchmarking practices
and propose adaptations to enhance them
Semi-Supervised Learning for Diagnosing Faults in Electromechanical Systems
Safe and reliable operation of the systems relies on the use of online condition monitoring and diagnostic systems that aim to take immediate actions upon the occurrence of a fault. Machine learning techniques are widely used for designing data-driven diagnostic models. The training procedure of a data-driven model usually requires a large amount of labeled data, which may not be always practical. This problem can be untangled by resorting to semi-supervised learning approaches, which enables the decision making procedure using only a few numbers of labeled samples coupled with a large number of unlabeled samples. Thus, it is crucial to conduct a critical study on the use of semi-supervised learning for the purpose of fault diagnosis. Another issue of concern is fault diagnosis in non-stationary environments, where data streams evolve over time, and as a result, model-based and most of the data-driven models are impractical. In this work, this has been addressed by means of an adaptive data-driven diagnostic model
- …