Search CORE

464 research outputs found

ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem

Author: Alpaydin
Bacardit
Bacardit
Bacardit
Batista
Blagus
Blagus
Breiman
Caruana
Chen
Cheng
Cover
Das
Dean
Dean
del Río
Fernández
Francisco Herrera
Galar
Isaac Triguero
Jaume Bacardit
Jolliffe
Jones
José M. Benítez
Kambatla
Krawczyk
Larraaga
López
Mamitsuka
Monastyrskyy
Neri
Palit
Punta
Qi
Saeys
Sara del Río
Stout
Triguero
Triguero
Victoria López
White
Wu
Zhang
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

The application of data mining and machine learning techniques to biological and biomedicine data continues to be an ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and store large quantities of data about cells, proteins, genes, etc., that should be processed. Moreover, in many of these problems such as contact map prediction, the problem tackled in this paper, it is difficult to collect representative positive examples. Learning under these circumstances, known as imbalanced big data classification, may not be straightforward for most of the standard machine learning methods. In this work we describe the methodology that won the ECBDL’14 big data challenge for a bioinformatics big data problem. This algorithm, named as ROSEFW-RF, is based on several MapReduce approaches to (1) balance the classes distribution through random oversampling, (2) detect the most relevant features via an evolutionary feature weighting process and a threshold to choose them, (3) build an appropriate Random Forest model from the pre-processed data and finally (4) classify the test data. Across the paper, we detail and analyze the decisions made during the competition showing an extensive experimental study that characterize the way of working of our methodology. From this analysis we can conclude that this approach is very suitable to tackle large-scale bioinformatics classifications problems

Nottingham ePrints

Nottingham eTheses

Crossref

Repository@Nottingham

Ghent University Academic Bibliography

Repositorio Institucional Universidad de Granada

CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification

Author: Ahmed Sajid
Farid Dewan Md.
Jani Md. Rafsan
Mahbub Asif
Rayhan Farshid
Shatabda Swakkhar
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/12/2017
Field of study

Class imbalance classification is a challenging research problem in data mining and machine learning, as most of the real-life datasets are often imbalanced in nature. Existing learning algorithms maximise the classification accuracy by correctly classifying the majority class, but misclassify the minority class. However, the minority class instances are representing the concept with greater interest than the majority class instances in real-life applications. Recently, several techniques based on sampling methods (under-sampling of the majority class and over-sampling the minority class), cost-sensitive learning methods, and ensemble learning have been used in the literature for classifying imbalanced datasets. In this paper, we introduce a new clustering-based under-sampling approach with boosting (AdaBoost) algorithm, called CUSBoost, for effective imbalanced classification. The proposed algorithm provides an alternative to RUSBoost (random under-sampling with AdaBoost) and SMOTEBoost (synthetic minority over-sampling with AdaBoost) algorithms. We evaluated the performance of CUSBoost algorithm with the state-of-the-art methods based on ensemble learning like AdaBoost, RUSBoost, SMOTEBoost on 13 imbalance binary and multi-class datasets with various imbalance ratios. The experimental results show that the CUSBoost is a promising and effective approach for dealing with highly imbalanced datasets.Comment: CSITSS-201

arXiv.org e-Print Archive

Crossref

On the relevance of preprocessing in predictive maintenance for dynamic systems

Author: A Chuang
A Graves
A Savitzky
AJ Smola
AP Bradley
B Schölkopf
B Schölkopf
BS Yang
BW Silverman
C Cernuda
C Cernuda
C Cernuda
C Cernuda
C Phua
C Wang
Carlos Cernuda
CE Shannon
D Cabrera
D Freedman
D Li
D Lin
D Wolpert
D Wu
DB Rubin
DL Wilson
E Lughofer
F Fleuret
F Serdio
F Serdio
F Serdio
G Brown
G Qiu
G Weiss
GEAPA Batista
GEP Box
H Peng
H Yang
H Zou
HB Mann
HJ Weaver
I Daubechies
I Guyon
I Guyon
I Jolliffe
I Tomek
J Gerretzen
J Ville
JB Tenenbaum
Jorma Laurikkala
K Greff
K Tschumitschew
K Varmuza
KV Branden
L Breiman
L Breiman
L Maaten
L Tan
L Zhang
M Bartlett
M Frigo
M Hubert
M Jung
M Li
MA Oliveira
MR Smith
N Friedman
N Kwak
NE Huang
NV Chawla
NV Chawla
O Troyanskaya
P Duhamel
P Mahalanobis
P Welch
PE Hart
R Battiti
R Kohavi
R Nikzad-Langerodi
R Nunkesser
R Tibshirani
RC Sharpley
RD Maesschalck
RM Sakia
RN Bracewell
S García
S Gelper
S Hochreiter
S Kadambe
S Oba
S Roweis
SA Dudani
SE Said
SG Mallat
Sudipto Guha
T Benkedjouh
T Hastie
T Hastie
T Hofmann
T Jo
T Loutas
TY Wu
V Vapnik
W Pedrycz
Y Saeys
Publication venue
Publication date: 01/01/2018
Field of study

The complexity involved in the process of real-time data-driven monitoring dynamic systems for predicted maintenance is usually huge. With more or less in-depth any data-driven approach is sensitive to data preprocessing, understood as any data treatment prior to the application of the monitoring model, being sometimes crucial for the final development of the employed monitoring technique. The aim of this work is to quantify the sensitiveness of data-driven predictive maintenance models in dynamic systems in an exhaustive way. We consider a couple of predictive maintenance scenarios, each of them defined by some public available data. For each scenario, we consider its properties and apply several techniques for each of the successive preprocessing steps, e.g. data cleaning, missing values treatment, outlier detection, feature selection, or imbalance compensation. The pretreatment configurations, i.e. sequential combinations of techniques from different preprocessing steps, are considered together with different monitoring approaches, in order to determine the relevance of data preprocessing for predictive maintenance in dynamical systems

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

BCAM's Institutional Repository Data

A Comprehensive Survey on Rare Event Prediction

Author: Sheth Amit
Shyalika Chathurangi
Wickramarachchi Ruwan
Publication venue
Publication date: 20/09/2023
Field of study

Rare event prediction involves identifying and forecasting events with a low probability using machine learning and data analysis. Due to the imbalanced data distributions, where the frequency of common events vastly outweighs that of rare events, it requires using specialized methods within each step of the machine learning pipeline, i.e., from data processing to algorithms to evaluation protocols. Predicting the occurrences of rare events is important for real-world applications, such as Industry 4.0, and is an active research area in statistical and machine learning. This paper comprehensively reviews the current approaches for rare event prediction along four dimensions: rare event data, data processing, algorithmic approaches, and evaluation approaches. Specifically, we consider 73 datasets from different modalities (i.e., numerical, image, text, and audio), four major categories of data processing, five major algorithmic groupings, and two broader evaluation approaches. This paper aims to identify gaps in the current literature and highlight the challenges of predicting rare events. It also suggests potential research directions, which can help guide practitioners and researchers.Comment: 44 page

arXiv.org e-Print Archive

JPPRED: Prediction of Types of J-Proteins from Imbalanced Data Using an Ensemble Learning Method

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2015
Field of study

Crossref

Evaluating Sampling Techniques for Healthcare Insurance Fraud Detection in Imbalanced Dataset

Author: Hartomo Kristoko Dwi
Lopo Joanito Agili
Publication venue: 'Universitas Ahmad Dahlan, Kampus 3'
Publication date: 18/04/2023
Field of study

Detecting fraud in the healthcare insurance dataset is challenging due to severe class imbalance, where fraud cases are rare compared to non-fraud cases. Various techniques have been applied to address this problem, such as oversampling and undersampling methods. However, there is a lack of comparison and evaluation of these sampling methods. Therefore, the research contribution of this study is to conduct a comprehensive evaluation of the different sampling methods in different class distributions, utilizing multiple evaluation metrics, including , , , Precision, and Recall. In addition, a model evaluation approach be proposed to address the issue of inconsistent scores in different metrics. This study employs a real-world dataset with the XGBoost algorithm utilized alongside widely used data sampling techniques such as Random Oversampling and Undersampling, SMOTE, and Instance Hardness Threshold. Results indicate that Random Oversampling and Undersampling perform well in the 50% distribution, while SMOTE and Instance Hardness Threshold methods are more effective in the 70% distribution. Instance Hardness Threshold performs best in the 90% distribution. The 70% distribution is more robust with the SMOTE and Instance Hardness Threshold, particularly in the consistent score in different metrics, although they have longer computation times. These models consistently performed well across all evaluation metrics, indicating their ability to generalize to new unseen data in both the minority and majority classes. The study also identifies key features such as costs, diagnosis codes, type of healthcare service, gender, and severity level of diseases, which are important for accurate healthcare insurance fraud detection. These findings could be valuable for healthcare providers to make informed decisions with lower risks. A well-performing fraud detection model ensures the accurate classification of fraud and non-fraud cases. The findings also can be used by healthcare insurance providers to develop more effective fraud detection and prevention strategies

Journal of Education and Learning (EduLearn)

UAD Journal Management System

Advanced Data Analytics for Systematic Review Creation and Update

Author: Timsina Prem
Publication venue: Beadle Scholar
Publication date: 01/03/2016
Field of study

Beadle Scholar at Dakota State University

STREAM-EVOLVING BOT DETECTION FRAMEWORK USING GRAPH-BASED AND FEATURE-BASED APPROACHES FOR IDENTIFYING SOCIAL BOTS ON TWITTER

Author: Alothali Eiman
Publication venue: Scholarworks@UAEU
Publication date: 01/06/2023
Field of study

This dissertation focuses on the problem of evolving social bots in online social networks, particularly Twitter. Such accounts spread misinformation and inflate social network content to mislead the masses. The main objective of this dissertation is to propose a stream-based evolving bot detection framework (SEBD), which was constructed using both graph- and feature-based models. It was built using Python, a real-time streaming engine (Apache Kafka version 3.2), and our pretrained model (bot multi-view graph attention network (Bot-MGAT)). The feature-based model was used to identify predictive features for bot detection and evaluate the SEBD predictions. The graph-based model was used to facilitate multiview graph attention networks (GATs) with fellowship links to build our framework for predicting account labels from streams. A probably approximately correct learning framework was applied to confirm the accuracy and confidence levels of SEBD.The results showed that the SEBD can effectively identify bots from streams and profile features are sufficient for detecting social bots. The pretrained Bot-MGAT model uses fellowship links to reveal hidden information that can aid in identifying bot accounts. The significant contributions of this study are the development of a stream based bot detection framework for detecting social bots based on a given hashtag and the proposal of a hybrid approach for feature selection to identify predictive features for identifying bot accounts. Our findings indicate that Twitter has a higher percentage of active bots than humans in hashtags. The results indicated that stream-based detection is more effective than offline detection by achieving accuracy score 96.9%. Finally, semi supervised learning (SSL) can solve the issue of labeled data in bot detection tasks

United Arab Emirates University: Scholarworks@UAEU / جامعة الامارات