Search CORE

4 research outputs found

Surrounding neighborhood-based SMOTE for learning from imbalanced data sets

Author: García Jiménez Vicente
Martín Félez Raúl
Mollineda Ramón A.
Sánchez Garreta Josep Salvador
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Many traditional approaches to pattern classiﬁ- cation assume that the problem classes share similar prior probabilities. However, in many real-life applications, this assumption is grossly violated. Often, the ratios of prior probabilities between classes are extremely skewed. This situation is known as the class imbalance problem. One of the strategies to tackle this problem consists of balancing the classes by resampling the original data set. The SMOTE algorithm is probably the most popular technique to increase the size of the minority class by generating synthetic instances. From the idea of the original SMOTE, we here propose the use of three approaches to surrounding neighborhood with the aim of generating artiﬁcial minority instances, but taking into account both the proximity and the spatial distribution of the examples. Experiments over a large collection of databases and using three different classiﬁers demonstrate that the new surrounding neighborhood-based SMOTE procedures signiﬁcantly outperform other existing over-sampling algorithms

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Repositori Institucional de la Universitat Jaume I

Values and inductive risk in machine learning modelling:the case of binary classification models

Author: Karaca Koray
Publication venue
Publication date: 01/12/2021
Field of study

University of Twente Research Information

Surrounding neighborhood-based SMOTE for learning from imbalanced data sets

Author: A Orriols-Puig
BB Chaudhuri
CS Hilas
D Zhang
D-C Li
E Chen
G Cohen
GEAPA Batista
H He
I Brown
J Alcalá-Fdez
J Demšar
J Derrac
J Huang
J Zhang
JW Jaromczyk
M Hall
M Kubat
M Sokolova
N Japkowicz
NV Chawla
R Barandela
S Daskalaki
S García
S Tan
S-H Oh
T Fawcett
T Fawcett
V Ganganwar
V García
Y Jiang
Y Sun
Y-M Huang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Statistical methods for NHS incident reporting data

Author: Mainey Christopher Paul
Publication venue: UCL (University College London)
Publication date: 28/04/2020
Field of study

The National Reporting and Learning System (NRLS) is the English and Welsh NHS’ national repository of incident reports from healthcare. It aims to capture details of incident reports, at national level, and facilitate clinical review and learning to improve patient safety. These incident reports range from minor ‘near-misses’ to critical incidents that may lead to severe harm or death. NRLS data are currently reported as crude counts and proportions, but their major use is clinical review of the free-text descriptions of incidents. There are few well-developed quantitative analysis approaches for NRLS, and this thesis investigates these methods. A literature review revealed a wealth of clinical detail, but also systematic constraints of NRLS’ structure, including non-mandatory reporting, missing data and misclassification. Summary statistics for reports from 2010/11 – 2016/17 supported this and suggest NRLS was not suitable for statistical modelling in isolation. Modelling methods were advanced by creating a hybrid dataset using other sources of hospital casemix data from Hospital Episode Statistics (HES). A theoretical model was established, based on ‘exposure’ variables (using casemix proxies), and ‘culture’ as a random-effect. The initial modelling approach examined Poisson regression, mixture and multilevel models. Overdispersion was significant, generated mainly by clustering and aggregation in the hybrid dataset, but models were chosen to reflect these structures. Further modelling approaches were examined, using Generalized Additive Models to smooth predictor variables, regression tree-based models including Random Forests, and Artificial Neural Networks. Models were also extended to examine a subset of death and severe harm incidents, exploring how sparse counts affect models. Text mining techniques were examined for analysis of incident descriptions and showed how term frequency might be used. Terms were used to generate latent topics models used, in-turn, to predict the harm level of incidents. Model outputs were used to create a ‘Standardised Incident Reporting Ratio’ (SIRR) and cast this in the mould of current regulatory frameworks, using process control techniques such as funnel plots and cusum charts. A prototype online reporting tool was developed to allow NHS organisations to examine their SIRRs, provide supporting analyses, and link data points back to individual incident reports

UCL Discovery