11 research outputs found

    Company bankruptcy prediction framework based on the most influential features using XGBoost and stacking ensemble learning

    Get PDF
    Company bankruptcy is often a very big problem for companies. The impact of bankruptcy can cause losses to elements of the company such as owners, investors, employees, and consumers. One way to prevent bankruptcy is to predict the possibility of bankruptcy based on the company's financial data. Therefore, this study aims to find the best predictive model or method to predict company bankruptcy using the dataset from Polish companies bankruptcy. The prediction analysis process uses the best feature selection and ensemble learning. The best feature selection is selected using feature importance to XGBoost with a weight value filter of 10. The ensemble learning method used is stacking. Stacking is composed of the base model and meta learner. The base model consists of K-nearest neighbor, decision tree, SVM, and random forest, while the meta learner used is LightGBM. The stacking model accuracy results can outperform the base model accuracy with an accuracy rate of 97%

    Prediction of Covid-19 Using Fuzzy-Rough Nearest Neighbor Classification

    Get PDF
    Prediction refers to the process of using data and statistical or machine learning techniques to estimate or forecast future events or outcomes based on patterns and trends observed in historical data. The goal of prediction is to make accurate forecasts about what is likely to happen in the future, given what is known about past events and trends. The corona virus has created a global pandemic that significantly disrupted our daily schedule and behaviour patterns. Individuals who contract COVID-19 experience a range of symptoms, which can vary in severity. It is crucial to promptly assess the health condition of individuals affected by COVID-19 by evaluating their symptoms and obtaining essential information. . . To assist in this task, physicians rely on rapid and precise Artificial Intelligence (AI) techniques that aid in predicting patients’ mortality risk and the severity of their conditions. Early identification of a patient’s severity can help conserve hospital resources and prevent patient fatalities by facilitating immediate medical interventions. This research paper introduces an innovative approach that employs the FRNN technique to train a classifier capable of achieving remarkable accuracy in predicting the survival outcomes of COVID-19-affected people. The model is trained on 11 attributes, out of which five are the primary clinical symptoms of this fatal virus: Nasal-Congestion, cough, tiredness, runny nose, fever, sore throat, Diarrhea, and breath shortness, and the other three features are test indication, age, and gender. Our proposed approach, which employs the ENN-SMOTE algorithm to tackle the issue of imbalanced data, demonstrates remarkable effectiveness as evidenced by the experimental results

    Handling Imbalanced Data through Re-sampling: Systematic Review

    Get PDF
    Handling imbalanced data is an important issue that can affect the validity and reliability of the results. One common approach to addressing this issue is through re-sampling the data. Re-sampling is a technique that allows researchers to balance the class distribution of their dataset by either over-sampling the minority class or under-sampling the majority class. Over-sampling involves adding more copies of the minority class examples to the dataset in order to balance out the class distribution. On the other hand, under-sampling involves removing some of the majority class examples from the dataset in order to balance out the class distribution. It's also common to combine both techniques, usually called hybrid sampling. It is important to note that re-sampling techniques can have an impact on the model's performance, and it is essential to evaluate the model using different evaluation metrics and to consider other techniques such as cost-sensitive learning and anomaly detection. In addition, it is important to keep in mind that increasing the sample size is always a good idea to improve the performance of the model. In this systematic review, we aim to provide an overview of existing methods for re-sampling imbalanced data. We will focus on methods that have been proposed in the literature and evaluate their effectiveness through a thorough examination of experimental results. The goal of this review is to provide practitioners with a comprehensive understanding of the different re-sampling methods available, as well as their strengths and weaknesses, to help them make informed decisions when dealing with imbalanced data

    Generating synthetic data for credit card fraud detection using GANs

    Get PDF
    Deep learning-based classifiers for object classification and recognition have been utilized in various sectors. However according to research papers deep neural networks achieve better performance using balanced datasets than imbalanced ones. It’s been observed that datasets are often imbalanced due to less fraud cases in production environments. Deep generative approaches, such as GANs have been applied as an efficient method to augment high-dimensional data. In this research study, the classifiers based on a Random Forest, Nearest Neighbor, Logistic Regression, MLP, Adaboost were trained utilizing our novel K-CGAN approach and compared using other oversampling approaches achieving higher F1 score performance metrics. Experiments demonstrate that the classifiers trained on the augmented set achieved far better performance than the same classifiers trained on the original data producing an effective fraud detection mechanism. Furthermore, this research demonstrates the problem with data imbalance and introduces a novel model that's able to generate high quality synthetic data

    Selective oversampling approach for strongly imbalanced data

    Get PDF
    Challenges posed by imbalanced data are encountered in many real-world applications. One of the possible approaches to improve the classifier performance on imbalanced data is oversampling. In this paper, we propose the new selective oversampling approach (SOA) that first isolates the most representative samples from minority classes by using an outlier detection technique and then utilizes these samples for synthetic oversampling. We show that the proposed approach improves the performance of two state-of-the-art oversampling methods, namely, the synthetic minority oversampling technique and adaptive synthetic sampling. The prediction performance is evaluated on four synthetic datasets and four real-world datasets, and the proposed SOA methods always achieved the same or better performance than other considered existing oversampling methods

    Un análisis bibliométrico de la predicción de quiebra empresarial con Machine Learning

    Get PDF
    The aim of this article is to present a bibliometric analysis on the use that Machine Learning (ML) techniques have had in the process of predicting business bankruptcy through the review of the Web of Science database. This exercise provides information on the initiation and adaptation process of such techniques. For this, the different ml techniques applied in the bankruptcy prediction model are identified. As a result, 327 documents are obtained, of which they are clas­sified by performance evaluation measure, the area under the curve (AUC) and precision (ACC), these being the most used in the classification process. In ad­dition, the relationship between researchers, institutions and countries with the largest number of applications of this type is identified. The results show how the XGBoost, SVM, Smote, RF and D algorithms present a much greater predictive capacity than traditional methodologies; focused on a time horizon before the event given its greater precision. Similarly, financial and non-financial variables contribute favorably to said estimate.El objetivo de este artículo es presentar un análisis bibliométrico sobre el uso que han tenido las técnicas de Machine Learning (ML) en el proceso de predic­ción de quiebra empresarial a través de la revisión de la base de datos Web of Science. Este ejercicio brinda información sobre el inicio y el proceso de adap­tación de dichas técnicas. Para ello, se identifican las diferentes técnicas de ml aplicadas en modelo de predicción de quiebras. Se obtiene como resultado 327 documentos, los cuales se clasifican por medida de evaluación del desempe­ño, área bajo la curva (AUC) y precisión (ACC), por ser las más utilizadas en el proceso de clasificación. Además, se identifica la relación entre investigadores, instituciones y países con mayor número de aplicaciones de este tipo. Los re­sultados evidencian que los algoritmos XGBoost, SVM, Smote, RFY DT presentan una capacidad predictiva mucho mayor que las metodologías tradicionales, en­focados en un horizonte de tiempo antes del suceso dada su mayor precisión. Así mismo, las variables financieras y no financieras contribuyen de manera favorable a dicha estimación

    AK-means geometric smote with data complexity analysis for imbalanced dataset

    Get PDF
    Many binary class datasets in real-life applications are affected by class imbalance problem. Data complexities like noise examples, class overlap and small disjuncts problems are observed to play a key role in producing poor classification performance. These complexities tend to exist in tandem with class imbalance problem. Synthetic Minority Oversampling Technique (SMOTE) is a well-known method to re-balance the number of examples in imbalanced datasets. However, this technique cannot effectively tackle data complexities and has the capability of magnifying the degree of complexities. Therefore, various SMOTE variants have been proposed to overcome the downsides of SMOTE. Furthermore, no existing study has yet to identify the correlation between N1 complexity measure and classification measures such as geometric mean (G-Mean) and F1-Score. This study aims: (i) to identify the suitable complexity measures that have correlation with performance measures, (ii) to propose a new SMOTE variant which is K-Means Geometric SMOTE (KM-GSMOTE) that incorporates complexity measures during synthetic data generation task, and (iii) to evaluate KM-GSMOTE in term of classification performance. Series of experiments have been conducted to evaluate the classification performances related to G-Mean and F1-Score as well as the measurement of N1 complexity of benchmark SMOTE variants and KM-GSMOTE. The performance of KM-GSMOTE was evaluated on 6 benchmark binary datasets from the UCI repository. KM-GSMOTE records the highest percentage of average differences of G-Mean (22.76%) and F1-Score (15.13%) for SVM classifier. A correlation between classification measures and N1 complexity measures has been observed from the experimental results. The contributions of this study are (i) introduction of KM-GSMOTE that combines complexity measurement with model selection to pick models with the best classification performance and lower complexity value and (ii) observation of connection between classification performance and complexity measure, showing that as N1 complexity measure decreases, the likelihood of obtaining a substantial classification performance increases

    Predictive analytics framework for electronic health records with machine learning advancements : optimising hospital resources utilisation with predictive and epidemiological models

    Get PDF
    The primary aim of this thesis was to investigate the feasibility and robustness of predictive machine-learning models in the context of improving hospital resources’ utilisation with data- driven approaches and predicting hospitalisation with hospital quality assessment metrics such as length of stay. The length of stay predictions includes the validity of the proposed methodological predictive framework on each hospital’s electronic health records data source. In this thesis, we relied on electronic health records (EHRs) to drive a data-driven predictive inpatient length of stay (LOS) research framework that suits the most demanding hospital facilities for hospital resources’ utilisation context. The thesis focused on the viability of the methodological predictive length of stay approaches on dynamic and demanding healthcare facilities and hospital settings such as the intensive care units and the emergency departments. While the hospital length of stay predictions are (internal) healthcare inpatients outcomes assessment at the time of admission to discharge, the thesis also considered (external) factors outside hospital control, such as forecasting future hospitalisations from the spread of infectious communicable disease during pandemics. The internal and external splits are the thesis’ main contributions. Therefore, the thesis evaluated the public health measures during events of uncertainty (e.g. pandemics) and measured the effect of non-pharmaceutical intervention during outbreaks on future hospitalised cases. This approach is the first contribution in the literature to examine the epidemiological curves’ effect using simulation models to project the future hospitalisations on their strong potential to impact hospital beds’ availability and stress hospital workflow and workers, to the best of our knowledge. The main research commonalities between chapters are the usefulness of ensembles learning models in the context of LOS for hospital resources utilisation. The ensembles learning models anticipate better predictive performance by combining several base models to produce an optimal predictive model. These predictive models explored the internal LOS for various chronic and acute conditions using data-driven approaches to determine the most accurate and powerful predicted outcomes. This eventually helps to achieve desired outcomes for hospital professionals who are working in hospital settings

    Application of data analytics for predictive maintenance in aerospace: an approach to imbalanced learning.

    Get PDF
    The use of aircraft operational logs to predict potential failure that may lead to disruption poses many challenges and has yet to be fully explored. These logs are captured during each flight and contain streamed data from various aircraft subsystems relating to status and warning indicators. They may, therefore, be regarded as complex multivariate time-series data. Given that aircraft are high-integrity assets, failures are extremely rare, and hence the distribution of relevant data containing prior indicators will be highly skewed to the normal (healthy) case. This will present a significant challenge in using data-driven techniques to 'learning' relationships/patterns that depict fault scenarios since the model will be biased to the heavily weighted no-fault outcomes. This thesis aims to develop a predictive model for aircraft component failure utilising data from the aircraft central maintenance system (ACMS). The initial objective is to determine the suitability of the ACMS data for predictive maintenance modelling. An exploratory analysis of the data revealed several inherent irregularities, including an extreme data imbalance problem, irregular patterns and trends, class overlapping, and small class disjunct, all of which are significant drawbacks for traditional machine learning algorithms, resulting in low-performance models. Four novel advanced imbalanced classification techniques are developed to handle the identified data irregularities. The first algorithm focuses on pattern extraction and uses bootstrapping to oversample the minority class; the second algorithm employs the balanced calibrated hybrid ensemble technique to overcome class overlapping and small class disjunct; the third algorithm uses a derived loss function and new network architecture to handle extremely imbalanced ratios in deep neural networks; and finally, a deep reinforcement learning approach for imbalanced classification problems in log- based datasets is developed. An ACMS dataset and its accompanying maintenance records were used to validate the proposed algorithms. The research's overall finding indicates that an advanced method for handling extremely imbalanced problems using the log-based ACMS datasets is viable for developing robust data-driven predictive maintenance models for aircraft component failure. When the four implementations were compared, deep reinforcement learning (DRL) strategies, specifically the proposed double deep State-action-reward-state-action with prioritised experience reply memory (DDSARSA+PER), outperformed other methods in terms of false-positive and false-negative rates for all the components considered. The validation result further suggests that the DDSARSA+PER model is capable of predicting around 90% of aircraft component replacements with a 0.005 false-negative rate in both A330 and A320 aircraft families studied in this researchPhD in Transport System

    Implementation of machine learning for the evaluation of mastitis and antimicrobial resistance in dairy cows

    Get PDF
    Bovine mastitis is one of the biggest concerns in the dairy industry, where it affects sustainable milk production, farm economy and animal health. Most of the mastitis pathogens are bacterial in origin and accurate diagnosis of them enables understanding the epidemiology, outbreak prevention and rapid cure of the disease. This thesis aimed to provide a diagnostic solution that couples Matrix-Assisted Laser Desorption/Ionization-Time of Flight (MALDI-TOF) mass spectroscopy coupled with machine learning (ML), for detecting bovine mastitis pathogens at the subspecies level based on their phenotypic characters. In Chapter 3, MALDI-TOF coupled with ML was performed to discriminate bovine mastitis-causing Streptococcus uberis based on transmission routes; contagious and environmental. S. uberis isolates collected from dairy farms across England and Wales were compared within and between farms. The findings of this chapter suggested that the proposed methodology has the potential of successful classification at the farm level. In Chapter 4, MALDI-TOF coupled with ML was performed to show proteomic differences between bovine mastitis-causing Escherichia coli isolates with different clinical outcomes (clinical and subclinical) and disease phenotype (persistent and non-persistent). The findings of this chapter showed that phenotypic differences can be detected by the proposed methodology even for genotypically identical isolates. In Chapter 5, MALDI-TOF coupled with ML was performed to differentiate benzylpenicillin signatures of bovine mastitis-causing Staphylococcus aureus isolates. The findings of this chapter presented that the proposed methodology enables fast, affordable and effective diag-nostic solution for targeting resistant bacteria in dairy cows. Having shown this methodology successfully worked for differentiating benzylpenicillin resistant and susceptible S. aureus isolates in Chapter 5, the same technique was applied to other mastitis agents Enterococcus faecalis and Enterococcus faecium and for profiling other antimicrobials besides benzylpenicillin in Chapter 6. The findings of this chapter demonstrated that MALDI-TOF coupled with ML allows monitoring the disease epidemiology and provides suggestions for adjusting farm management strategies. Taken together, this thesis highlights that MALDI-TOF coupled with ML is capable of dis-criminating bovine mastitis pathogens at subspecies level based on transmission route, clinical outcome and antimicrobial resistance profile, which could be used as a diagnostic tool for bo-vine mastitis at dairy farms
    corecore