129 research outputs found

    Credit risk modeling: A comparative analysis of artificial and deep neural networks

    Get PDF
    Credit risk assessment plays a major role in the banks and financial institutions to prevent counterparty risk failure. One of the primary capabilities of a robust risk management system must be detecting the risks earlier, though many of the bank systems today lack this key capability which leads to further losses (MGI, 2017). In searching for an improved methodology to detect such credit risk and increasing the lacking capabilities earlier, a comparative analysis between Deep Neural Network (DNN) and machine learning techniques such as Support Vector Machines (SVM), K-Nearest Neighbours (KNN) and Artificial Neural Network (ANN) were conducted. The Deep Neural Network used in this study consists of six layers of neurons. Further, sampling techniques such as SMOTE, SVM-SMOTE, RUS, and All-KNN to make the imbalanced dataset a balanced one were also applied. Using supervised learning techniques, the proposed DNN model was able to achieve an accuracy of 82.18% with a ROC score of 0.706 using the RUS sampling technique. The All KNN sampling technique was capable of achieving the maximum true positives in two different models. Using the proposed approach, banks and credit check institutions can help prevent major losses occurring due to counterparty risk failure.credit riskdeep neural networkartificial neural networksupport vector machinessampling technique

    Predicting breast cancer risk, recurrence and survivability

    Full text link
    This thesis focuses on predicting breast cancer at early stages by using machine learning algorithms based on biological datasets. The accuracy of those algorithms has been improved to enable the physicians to enhance the success of treatment, thus saving lives and avoiding several further medical tests

    Handling Class Imbalance Using Swarm Intelligence Techniques, Hybrid Data and Algorithmic Level Solutions

    Get PDF
    This research focuses mainly on the binary class imbalance problem in data mining. It investigates the use of combined approaches of data and algorithmic level solutions. Moreover, it examines the use of swarm intelligence and population-based techniques to combat the class imbalance problem at all levels, including at the data, algorithmic, and feature level. It also introduces various solutions to the class imbalance problem, in which swarm intelligence techniques like Stochastic Diffusion Search (SDS) and Dispersive Flies Optimisation (DFO) are used. The algorithms were evaluated using experiments on imbalanced datasets, in which the Support Vector Machine (SVM) was used as a classifier. SDS was used to perform informed undersampling of the majority class to balance the dataset. The results indicate that this algorithm improves the classifier performance and can be used on imbalanced datasets. Moreover, SDS was extended further to perform feature selection on high dimensional datasets. Experimental results show that SDS can be used to perform feature selection and improve the classifier performance on imbalanced datasets. Further experiments evaluated DFO as an algorithmic level solution to optimise the SVM kernel parameters when learning from imbalanced datasets. Based on the promising results of DFO in these experiments, the novel approach was extended further to provide a hybrid algorithm that simultaneously optimises the kernel parameters and performs feature selection

    Evolving interval-based representation for multiple classifier fusion.

    Get PDF
    Designing an ensemble of classifiers is one of the popular research topics in machine learning since it can give better results than using each constituent member. Furthermore, the performance of ensemble can be improved using selection or adaptation. In the former, the optimal set of base classifiers, meta-classifier, original features, or meta-data is selected to obtain a better ensemble than using the entire classifiers and features. In the latter, the base classifiers or combining algorithms working on the outputs of the base classifiers are made to adapt to a particular problem. The adaptation here means that the parameters of these algorithms are trained to be optimal for each problem. In this study, we propose a novel evolving combining algorithm using the adaptation approach for the ensemble systems. Instead of using numerical value when computing the representation for each class, we propose to use the interval-based representation for the class. The optimal value of the representation is found through Particle Swarm Optimization. During classification, a test instance is assigned to the class with the interval-based representation that is closest to the base classifiers’ prediction. Experiments conducted on a number of popular dataset confirmed that the proposed method is better than the well-known ensemble systems using Decision Template and Sum Rule as combiner, L2-loss Linear Support Vector Machine, Multiple Layer Neural Network, and the ensemble selection methods based on GA-Meta-data, META-DES, and ACO

    Autoencoder-based techniques for improved classification in settings with high dimensional and small sized data

    Get PDF
    Neural network models have been widely tested and analysed usinglarge sized high dimensional datasets. In real world application prob-lems, the available datasets are often limited in size due to reasonsrelated to the cost or difficulties encountered while collecting the data.This limitation in the number of examples may challenge the clas-sification algorithms and degrade their performance. A motivatingexample for this kind of problem is predicting the health status of atissue given its gene expression, when the number of samples availableto learn from is very small.Gene expression data has distinguishing characteristics attracting themachine learning research community. The high dimensionality ofthe data is one of the integral features that has to be considered whenbuilding predicting models. A single sample of the data is expressedby thousands of gene expressions compared to the benchmark imagesand texts that only have a few hundreds of features and commonlyused for analysing the existing models. Gene expression data samplesare also distributed unequally among the classes; in addition, theyinclude noisy features which degrade the prediction accuracy of themodels. These characteristics give rise to the need for using effec-tive dimensionality reduction methods that are able to discover thecomplex relationships between the features such as the autoencoders. This thesis investigates the problem of predicting from small sizedhigh dimensional datasets by introducing novel autoencoder-basedtechniques to increase the classification accuracy of the data. Twoautoencoder-based methods for generating synthetic data examplesand synthetic representations of the data were respectively introducedin the first stage of the study. Both of these methods are applicableto the testing phase of the autoencoder and showed successful in in-creasing the predictability of the data.Enhancing the autoencoder’s ability in learning from small sized im-balanced data was investigated in the second stage of the projectto come up with techniques that improved the autoencoder’s gener-ated representations. Employing the radial basis activation mecha-nism used in radial-basis function networks, which learn in a super-vised manner, was a solution provided by this thesis to enhance therepresentations learned by unsupervised algorithms. This techniquewas later applied to stochastic variational autoencoders and showedpromising results in learning discriminating representations from thegene expression data.The contributions of this thesis can be described by a number of differ-ent methods applicable to different stages (training and testing) anddifferent autoencoder models (deterministic and stochastic) which, in-dividually, allow for enhancing the predictability of small sized highdimensional datasets compared to well known baseline methods

    Credit risk modeling: A comparative analysis of artificial and deep neural networks

    Get PDF
    Credit risk assessment plays a major role in the banks and financial institutions to prevent counterparty risk failure. One of the primary capabilities of a robust risk management system must be detecting the risks earlier, though many of the bank systems today lack this key capability which leads to further losses (MGI, 2017). In searching for an improved methodology to detect such credit risk and increasing the lacking capabilities earlier, a comparative analysis between Deep Neural Network (DNN) and machine learning techniques such as Support Vector Machines (SVM), K-Nearest Neighbours (KNN) and Artificial Neural Network (ANN) were conducted. The Deep Neural Network used in this study consists of six layers of neurons. Further, sampling techniques such as SMOTE, SVM-SMOTE, RUS, and All-KNN to make the imbalanced dataset a balanced one were also applied. Using supervised learning techniques, the proposed DNN model was able to achieve an accuracy of 82.18% with a ROC score of 0.706 using the RUS sampling technique. The All KNN sampling technique was capable of achieving the maximum true positives in two different models. Using the proposed approach, banks and credit check institutions can help prevent major losses occurring due to counterparty risk failure.credit riskdeep neural networkartificial neural networksupport vector machinessampling technique

    A Cost-Sensitive Ensemble Method for Class-Imbalanced Datasets

    Get PDF
    In imbalanced learning methods, resampling methods modify an imbalanced dataset to form a balanced dataset. Balanced data sets perform better than imbalanced datasets for many base classifiers. This paper proposes a cost-sensitive ensemble method based on cost-sensitive support vector machine (SVM), and query-by-committee (QBC) to solve imbalanced data classification. The proposed method first divides the majority-class dataset into several subdatasets according to the proportion of imbalanced samples and trains subclassifiers using AdaBoost method. Then, the proposed method generates candidate training samples by QBC active learning method and uses cost-sensitive SVM to learn the training samples. By using 5 class-imbalanced datasets, experimental results show that the proposed method has higher area under ROC curve (AUC), F-measure, and G-mean than many existing class-imbalanced learning methods

    Predicting dental implant failures by integrating multiple classifiers

    Get PDF
    El campo de la ciencia de datos ha tenido muchos avances respecto a la aplicación y desarrollo de técnicas en el sector de la salud. Estos avances se ven reflejados en la predicción de enfermedades, clasificación de imágenes, identificación y reducción de riesgos, así como muchos otros. Este trabajo tiene por objetivo investigar el beneficio de la utilización de múltiples algoritmos de clasificación, para la predicción de fracasos en implantes dentales de la provincia de Misiones, Argentina y proponer un procedimiento validado por expertos humanos. El modelo abarca la combinación de los clasificadores: Random Forest, C-Support Vector, K-Nearest Neighbors, Multinomial Naive Bayes y Multi-layer Perceptron. La integración de los modelos se realiza con el weighted soft voting method. La experimentación es realizada con cuatro conjuntos de datos, un conjunto de implantes dentales confeccionado para el estudio de caso, un conjunto generado artificialmente y otros dos conjuntos obtenidos de distintos repositorios de datos. Los resultados arrojados del enfoque propuesto sobre el conjunto de datos de implantes dentales, es validado con el desempeño en la clasificación por expertos humanos. Nuestro enfoque logra un porcentaje de acierto del 93% de casos correctamente identificados, mientras que los expertos humanos consiguen un 87% de precisión.The field of data science has made many advances in the application and development of techniques in several aspects of the health sector, such as in disease prediction, image classification, risk identification and risk reduction. Based on this, the objectives of this work were to investigate the benefit of using multiple classification algorithms to predict dental implant failures in patients from Misiones province, Argentina, and to propose a procedure validated by human experts. The model used the integration of several types of classifiers.The experimentation was performed with four data sets: a data set of dental implants made for the case study, an artificially generated data set, and two other data sets obtained from different data repositories. The results of the approach proposed were validated by the performance in classification made by human experts. Our approach achieved a success rate of 93% of correctly identified cases, whereas human experts achieved 87% accuracy. Based on this, we can argue that multi-classifier systems are a good approach to predict dental implant failures.Fil: Ganz, Nancy Beatriz. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Nordeste. Instituto de Materiales de Misiones. Universidad Nacional de Misiones. Facultad de Ciencias Exactas Químicas y Naturales. Instituto de Materiales de Misiones; ArgentinaFil: Ares, Alicia Esther. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Nordeste. Instituto de Materiales de Misiones. Universidad Nacional de Misiones. Facultad de Ciencias Exactas Químicas y Naturales. Instituto de Materiales de Misiones; ArgentinaFil: Kuna, Horacio Daniel. Universidad Nacional de Misiones. Facultad de Cs.exactas Quimicas y Naturales. Instituto de Investigacion Desarrollo E Innovacion En Informatica.; Argentin

    Improved adaptive semi-unsupervised weighted oversampling (IA-SUWO) using sparsity factor for imbalanced datasets

    Get PDF
    The imbalanced data problem is common in data mining nowadays due to the skewed nature of data, which impact the classification process negatively in machine learning. For preprocessing, oversampling techniques significantly benefitted the imbalanced domain, in which artificial data is generated in minority class to enhance the number of samples and balance the distribution of samples in both classes. However, existing oversampling techniques encounter through overfitting and over-generalization problems which lessen the classifier performance. Although many clustering based oversampling techniques significantly overcome these problems but most of these techniques are not able to produce the appropriate number of synthetic samples in minority clusters. This study proposed an improved Adaptive Semi-unsupervised Weighted Oversampling (IA-SUWO) technique, using the sparsity factor which determine the sparse minority samples in each minority cluster. This technique consider the sparse minority samples which are far from the decision boundary. These samples also carry the important information for learning of minority class, if these samples are also considered for oversampling, imbalance ratio will be more reduce also it could enhance the learnability of the classifiers. The outcomes of the proposed approach have been compared with existing oversampling techniques such as SMOTE, Borderline-SMOTE, Safe-level SMOTE, and standard A-SUWO technique in terms of accuracy. As aforementioned, the comparative analysis revealed that the proposed oversampling approach performance increased in average by 5% from 85% to 90% than the existing comparative techniques
    corecore