Search CORE

35 research outputs found

Techniques to deal with imbalanced data in multi-class problems: A review of existing methods

Author: Vitor Miguel Saraiva Esteves
Publication venue
Publication date: 11/02/2020
Field of study

Repositório Aberto da Universidade do Porto

Oversampling technique in student performance classification from engineering course

Author: Punlumjeak Wattana
Rachburee Nachirat
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/08/2021
Field of study

The first year of an engineering student was important to take proper academic planning. All subjects in the first year were essential for an engineering basis. Student performance prediction helped academics improve their performance better. Students checked performance by themselves. If they were aware that their performance are low, then they could make some improvement for their better performance. This research focused on combining the oversampling minority class data with various kinds of classifier models. Oversampling techniques were SMOTE, Borderline-SMOTE, SVMSMOTE, and ADASYN and four classifiers were applied using MLP, gradient boosting, AdaBoost and random forest in this research. The results represented that Borderline-SMOTE gave the best result for minority class prediction with several classifiers

ZENODO

Institute of Advanced Engineering and Science

Filter � GA Based Approach to Feature Selection for Classification

Author: Amit Saxena, Madan Madhaw Shrivas
Publication venue: Auricle Global Society of Education and Research
Publication date: 30/11/2017
Field of study

This paper presents a new approach to select reduced number of features in databases. Every database has a given number of features but it is observed that some of these features can be redundant and can be harmful as well as and can confuse the process of classification. The proposed method applies filter attribute measure and binary coded Genetic Algorithm to select a small subset of features. The importance of these features is judged by applying K-nearest neighbor (KNN) method of classification. The best reduced subset of features which has high classification accuracy on given databases is adopted. The classification accuracy obtained by proposed method is compared with that reported recently in publications on twenty eight databases. It is noted that proposed method performs satisfactory on these databases and achieves higher classification accuracy but with smaller number of features

International Journal on Future Revolution in Computer Science & Communication Engineering

Software Defect Prediction Using AWEIG+ADACOST Bayesian Algorithm for Handling High Dimensional Data and Class Imbalance Problem

Author: Christanto Febrian Wahyu
Indriyawati Henny
Suntoro Joko
Publication venue: 'Universitas Kristen Satya Wacana'
Publication date: 31/10/2018
Field of study

The most important part in software engineering is a software defect prediction. Software defect prediction is defined as a software prediction process from errors, failures, and system errors. Machine learning methods are used by researchers to predict software defects including estimation, association, classification, clustering, and datasets analysis. Datasets of NASA Metrics Data Program (NASA MDP) is one of the metric software that researchers use to predict software defects. NASA MDP datasets contain unbalanced classes and high dimensional data, so they will affect the classification evaluation results to be low. In this research, data with unbalanced classes will be solved by the AdaCost method and high dimensional data will be handled with the Average Weight Information Gain (AWEIG) method, while the classification method that will be used is the Naïve Bayes algorithm. The proposed method is named AWEIG + AdaCost Bayesian. In this experiment, the AWEIG + AdaCost Bayesian algorithm is compared to the Naïve Bayesian algorithm. The results showed the mean of Area Under the Curve (AUC) algorithm AWEIG + AdaCost Bayesian yields better than just a Naïve Bayes algorithm with respectively mean of AUC values are 0.752 and 0.696

Portal Jurnal Elektronik Universitas Kristen Satya Wacana (UKSW)

Gearbox fault diagnosis based on VMD-MSE and adaboost classifier

Author: Chen Lu
Dengwei Song
Jian Ma
Publication venue: 'JVE International Ltd.'
Publication date: 21/10/2017
Field of study

Accurate and efficient fault diagnosis is of great importance for gearbox. This study proposed a fault diagnosis based on variational mode decomposition (VMD) – multiscale entropy (MSE) and adaboost algorithm. First, the VMD is employed to decompose the raw signal in time-frequency domain. Then, MSE is computed to generate the feature vectors. Finally, the classifier based on adaboost is training and several weak classifiers form a strong classifier to realize the fault diagnosis. The feasibility and accuracy of the method is validated by the data from the Prognostics and Health Management Society for the 2009 data challenge competition

Empowering One-vs-One Decomposition with Ensemble Learning for Multi-Class Imbalanced Data

Author: García López Salvador
Herrera Triguero Francisco
Krawczyk Bartosz
Rosales-Pérez Alejandro
Zhang Zhongliang
Publication venue: 'Elsevier BV'
Publication date: 01/04/2016
Field of study

Zhongliang Zhang was supported by the National Science Foundation of China (NSFC Proj. 61273204) and CSC Scholarship Program (CSC NO. 201406080059). Bartosz Krawczyk was supported by the Polish National Science Center under the grant no. UMO-2015/19/B/ST6/01597. Salvador Garcia and Francisco Herrera were partially supported by the Spanish Ministry of Education and Science under Project TIN2014-57251-P and the Andalusian Research Plan P10-TIC-6858, P11-TIC-7765. Alejandro Rosales-Perez was supported by the CONACyT grant 329013.Multi-class imbalance classification problems occur in many real-world applications, which suffer from the quite different distribution of classes. Decomposition strategies are well-known techniques to address the classification problems involving multiple classes. Among them binary approaches using one-vs-one and one-vs-all has gained a significant attention from the research community. They allow to divide multi-class problems into several easier-to-solve two-class sub-problems. In this study we develop an exhaustive empirical analysis to explore the possibility of empowering the one-vs-one scheme for multi-class imbalance classification problems with applying binary ensemble learning approaches. We examine several state-of-the-art ensemble learning methods proposed for addressing the imbalance problems to solve the pairwise tasks derived from the multi-class data set. Then the aggregation strategy is employed to combine the binary ensemble outputs to reconstruct the original multi-class task. We present a detailed experimental study of the proposed approach, supported by the statistical analysis. The results indicate the high effectiveness of ensemble learning with one-vs-one scheme in dealing with the multi-class imbalance classification problems.National Natural Science Foundation of China (NSFC) 61273204CSC Scholarship Program (CSC) 201406080059Polish National Science Center UMO-2015/19/B/ST6/01597Spanish Government TIN2014-57251-PAndalusian Research Plan P10-TIC-6858 P11-TIC-7765Consejo Nacional de Ciencia y Tecnologia (CONACyT) 32901

Repositorio Institucional Universidad de Granada

Light Gradient Boosting with Hyper Parameter Tuning Optimization for COVID-19 Prediction

Author: Ferda Ernawan
Kartika Handayani
Mohammad Fakhreldin
Yagoub Abbker
Publication venue: The Science and Information (SAI) Organization Limited
Publication date: 01/01/2022
Field of study

The 2019 coronavirus disease (COVID-19) caused pandemic and a huge number of deaths in the world. COVID-19 screening is needed to identify suspected positive COVID-19 or not and it can reduce the spread of COVID-19. The polymerase chain reaction (PCR) test for COVID-19 is a test that analyzes the respiratory specimen. The blood test also can be used to show people who have been infected with SARS-CoV-2. In addition, age parameters also contribute to the susceptibility of COVID-19 transmission. This paper presents the extra trees classification with random over-sampling by considering blood and age parameters for COVID-19 screening. This research proposes enhanced preprocessing data by using KNN Imputer to handle large missing values. The experiments evaluated the existing classification methods such as Random Forest, Extra Trees, Ada Boost, Gradient Boosting, and the proposed Light Gradient Boosting with hyperparameter tuning to measure the predictions of patients infected with SARS-CoV-2. The experiments used Albert Einstein Hospital test data in Brazil that consisted of 5,644 sample data from 559 patients with infected SARS-CoV-2. The experimental results show that the proposed scheme achieves an accuracy of about 98,58%, recall of 98,58%, the precision of 98,61%, F1-Score of 98,61%, and AUC of 0,9682

UMP Institutional Repository

A Powerful Paradigm for Cardiovascular Risk Stratification Using Multiclass, Multi-Label, and Ensemble-Based Machine Learning Paradigms: A Narrative Review

Author: Bhagawati Mrinalini
Faa Gavino
Johri Amer M.
Kalra Manudeep
Khanna Narendra N.
Kitas George D.
Laird John R.
Paraskevas Kosmas I.
Paul Sudip
Protogerou Athanasios D.
Ruzsa Zoltán
Saba Luca
Saksena Sanjay
Sfikakis Petros P.
Sharma Aditya M.
Suri Jasjit S.
Publication venue: 'MDPI AG'
Publication date: 01/01/2022
Field of study

Background and Motivation: Cardiovascular disease (CVD) causes the highest mortality globally. With escalating healthcare costs, early non-invasive CVD risk assessment is vital. Conventional methods have shown poor performance compared to more recent and fast-evolving Artificial Intelligence (AI) methods. The proposed study reviews the three most recent paradigms for CVD risk assessment, namely multiclass, multi-label, and ensemble-based methods in (i) office-based and (ii) stress-test laboratories. Methods: A total of 265 CVD-based studies were selected using the preferred reporting items for systematic reviews and meta-analyses (PRISMA) model. Due to its popularity and recent development, the study analyzed the above three paradigms using machine learning (ML) frameworks. We review comprehensively these three methods using attributes, such as architecture, applications, pro-and-cons, scientific validation, clinical evaluation, and AI risk-of-bias (RoB) in the CVD framework. These ML techniques were then extended under mobile and cloud-based infrastructure. Findings: Most popular biomarkers used were office-based, laboratory-based, image-based phenotypes, and medication usage. Surrogate carotid scanning for coronary artery risk prediction had shown promising results. Ground truth (GT) selection for AI-based training along with scientific and clinical validation is very important for CVD stratification to avoid RoB. It was observed that the most popular classification paradigm is multiclass followed by the ensemble, and multi-label. The use of deep learning techniques in CVD risk stratification is in a very early stage of development. Mobile and cloud-based AI technologies are more likely to be the future. Conclusions: AI-based methods for CVD risk assessment are most promising and successful. Choice of GT is most vital in AI-based models to prevent the RoB. The amalgamation of image-based strategies with conventional risk factors provides the highest stability when using the three CVD paradigms in non-cloud and cloud-based frameworks

SZTE Publicatio Repozitórium - SZTE - Repository of Publications

Archivio istituzionale della ricerca - Università di Cagliari