Search CORE

114 research outputs found

SMOClust: Synthetic Minority Oversampling based on Stream Clustering for Evolving Data Streams

Author: Chiu Chun Wai
Minku Leandro L.
Publication venue
Publication date: 28/08/2023
Field of study

Many real-world data stream applications not only suffer from concept drift but also class imbalance. Yet, very few existing studies investigated this joint challenge. Data difficulty factors, which have been shown to be key challenges in class imbalanced data streams, are not taken into account by existing approaches when learning class imbalanced data streams. In this work, we propose a drift adaptable oversampling strategy to synthesise minority class examples based on stream clustering. The motivation is that stream clustering methods continuously update themselves to reflect the characteristics of the current underlying concept, including data difficulty factors. This nature can potentially be used to compress past information without caching data in the memory explicitly. Based on the compressed information, synthetic examples can be created within the region that recently generated new minority class examples. Experiments with artificial and real-world data streams show that the proposed approach can handle concept drift involving different minority class decomposition better than existing approaches, especially when the data stream is severely class imbalanced and presenting high proportions of safe and borderline minority class examples.Comment: 59 pages, 85 figure

arXiv.org e-Print Archive

Resampling-Based Ensemble Methods for Online Class Imbalance Learning

Author: Minku Leandro L.
Wang Shuo
Yao Xin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/08/2014
Field of study

Online class imbalance learning is a new learning problem that combines the challenges of both online learning and class imbalance learning. It deals with data streams having very skewed class distributions. This type of problems commonly exists in real-world applications, such as fault diagnosis of real-time control monitoring systems and intrusion detection in computer networks. In our earlier work, we deﬁned class imbalance online, and proposed two learning algorithms OOB and UOB that build an ensemble model overcoming class imbalance in real time through resampling and time-decayed metrics. In this paper, we further improve the resampling strategy inside OOB and UOB, and look into their performance in both static and dynamicdatastreams.Wegivetheﬁrstcomprehensiveanalysisofclassimbalanceindatastreams,intermsofdatadistributions, imbalance rates and changes in class imbalance status. We ﬁnd that UOB is better at recognizing minority-class examples in static data streams, and OOB is more robust against dynamic changes in class imbalance status. The data distribution is a major factor affecting their performance. Based on the insight gained, we then propose two new ensemble methods that maintain both OOB and UOB with adaptive weights for ﬁnal predictions, called WEOB1 and WEOB2. They are shown to possess the strength of OOB and UOB with good accuracy and robustness

Crossref

Birmingham City University Open Access Repository

University of Birmingham Research Portal

BCU Open Access

Leicester Research Archive

Next challenges for adaptive learning systems

Author: Bifet A.
Gaber M.
Gabrys B.
Gama J.
Minku L.
Musial K.
Zliobaite I.
Publication venue
Publication date: 01/01/2012
Field of study

Learning from evolving streaming data has become a 'hot' research topic in the last decade and many adaptive learning algorithms have been developed. This research was stimulated by rapidly growing amounts of industrial, transactional, sensor and other business data that arrives in real time and needs to be mined in real time. Under such circumstances, constant manual adjustment of models is in-efficient and with increasing amounts of data is becoming infeasible. Nevertheless, adaptive learning models are still rarely employed in business applications in practice. In the light of rapidly growing structurally rich 'big data', new generation of parallel computing solutions and cloud computing services as well as recent advances in portable computing devices, this article aims to identify the current key research directions to be taken to bring the adaptive learning closer to application needs. We identify six forthcoming challenges in designing and building adaptive learning (pre-diction) systems: making adaptive systems scalable, dealing with realistic data, improving usability and trust, integrat-ing expert knowledge, taking into account various application needs, and moving from adaptive algorithms towards adaptive tools. Those challenges are critical for the evolving stream settings, as the process of model building needs to be fully automated and continuous.</jats:p

Crossref

University of Birmingham Research Portal

Portsmouth University Research Portal (Pure)

Tackling virtual and real concept drifts:an adaptive Gaussian mixture model approach

Author: Minku Leandro
Oliveira Adriano L. I.
Oliveira Gustavo H.F.M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/07/2021
Field of study

University of Birmingham Research Portal

GMM-VRD:a gaussian mixture model for dealing with virtual and real concept drifts

Author: Minku Leandro
Oliveira Adriano L. I.
Oliveira Gustavo H.F.M.
Publication venue: IEEE Computer Society
Publication date: 30/09/2019
Field of study

University of Birmingham Research Portal

Are 20% of files responsible for 80% of defects?

Author: Minku L.
Walkinshaw N.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/10/2018
Field of study

Background: Over the past two decades a mixture of anecdote from the industry and empirical studies from academia have suggested that the 80:20 rule (otherwise known as the Pareto Principle) applies to the relationship between source code files and the number of defects in the system: a small minority of files (roughly 20%) are responsible for a majority of defects (roughly 80%). Aims: This paper aims to establish how widespread the phenomenon is by analysing 100 systems (previous studies have focussed on between one and three systems), with the goal of whether and under what circumstances this relationship does hold, and whether the key files can be readily identified from basic metrics. Method: We devised a search criterion to identify defect fixes from commit messages and used this to analyse 100 active Github repositories, spanning a variety of languages and domains. We then studied the relationship between files, basic metrics (churn and LOC), and defect fixes. Results: We found that the Pareto principle does hold, but only if defects that incur fixes to multiple files count as multiple defects. When we investigated multi-file fixes, we found that key files (belonging to the top 20%) are commonly fixed alongside other much less frequently-fixed files. We found LOC to be poorly correlated with defect proneness, Code Churn was a more reliable indicator, but only for extremely high values of Churn. Conclusions: It is difficult to reliably identify the "most fixed" 20% of files from basic metrics. However, even if they could be reliably predicted, focussing on them would probably be misguided. Although fixes will naturally involve files that are often involved in other fixes too, they also tend to include other less frequently-fixed files

Crossref

University of Birmingham Research Portal

White Rose Research Online

Leicester Research Archive

A novel automated approach for software effort estimation based on data augmentation

Author: MINKU L.L.
SONG L.
YAO X.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 26/10/2018
Field of study

University of Birmingham Research Portal

Diversity-based pool of models for dealing with recurring concepts

Author: Chiu Chun Wai
Minku Leandro L.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/10/2018
Field of study

University of Birmingham Research Portal

The impact of parameter tuning on software effort estimation using learning machines

Author: Bishop C.
Boehm B.
Cartwright M.
Mair C.
Minku L.
Minku L. L.
Sentas P.
Tronto I. F. B.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

Crossref

University of Birmingham Research Portal