4 research outputs found

    Surrounding neighborhood-based SMOTE for learning from imbalanced data sets

    Get PDF
    Many traditional approaches to pattern classiïŹ- cation assume that the problem classes share similar prior probabilities. However, in many real-life applications, this assumption is grossly violated. Often, the ratios of prior probabilities between classes are extremely skewed. This situation is known as the class imbalance problem. One of the strategies to tackle this problem consists of balancing the classes by resampling the original data set. The SMOTE algorithm is probably the most popular technique to increase the size of the minority class by generating synthetic instances. From the idea of the original SMOTE, we here propose the use of three approaches to surrounding neighborhood with the aim of generating artiïŹcial minority instances, but taking into account both the proximity and the spatial distribution of the examples. Experiments over a large collection of databases and using three different classiïŹers demonstrate that the new surrounding neighborhood-based SMOTE procedures signiïŹcantly outperform other existing over-sampling algorithms

    Statistical methods for NHS incident reporting data

    Get PDF
    The National Reporting and Learning System (NRLS) is the English and Welsh NHS’ national repository of incident reports from healthcare. It aims to capture details of incident reports, at national level, and facilitate clinical review and learning to improve patient safety. These incident reports range from minor ‘near-misses’ to critical incidents that may lead to severe harm or death. NRLS data are currently reported as crude counts and proportions, but their major use is clinical review of the free-text descriptions of incidents. There are few well-developed quantitative analysis approaches for NRLS, and this thesis investigates these methods. A literature review revealed a wealth of clinical detail, but also systematic constraints of NRLS’ structure, including non-mandatory reporting, missing data and misclassification. Summary statistics for reports from 2010/11 – 2016/17 supported this and suggest NRLS was not suitable for statistical modelling in isolation. Modelling methods were advanced by creating a hybrid dataset using other sources of hospital casemix data from Hospital Episode Statistics (HES). A theoretical model was established, based on ‘exposure’ variables (using casemix proxies), and ‘culture’ as a random-effect. The initial modelling approach examined Poisson regression, mixture and multilevel models. Overdispersion was significant, generated mainly by clustering and aggregation in the hybrid dataset, but models were chosen to reflect these structures. Further modelling approaches were examined, using Generalized Additive Models to smooth predictor variables, regression tree-based models including Random Forests, and Artificial Neural Networks. Models were also extended to examine a subset of death and severe harm incidents, exploring how sparse counts affect models. Text mining techniques were examined for analysis of incident descriptions and showed how term frequency might be used. Terms were used to generate latent topics models used, in-turn, to predict the harm level of incidents. Model outputs were used to create a ‘Standardised Incident Reporting Ratio’ (SIRR) and cast this in the mould of current regulatory frameworks, using process control techniques such as funnel plots and cusum charts. A prototype online reporting tool was developed to allow NHS organisations to examine their SIRRs, provide supporting analyses, and link data points back to individual incident reports