9,130 research outputs found

    Linear and Order Statistics Combiners for Pattern Classification

    Full text link
    Several researchers have experimentally shown that substantial improvements can be obtained in difficult pattern recognition problems by combining or integrating the outputs of multiple classifiers. This chapter provides an analytical framework to quantify the improvements in classification results due to combining. The results apply to both linear combiners and order statistics combiners. We first show that to a first order approximation, the error rate obtained over and above the Bayes error rate, is directly proportional to the variance of the actual decision boundaries around the Bayes optimum boundary. Combining classifiers in output space reduces this variance, and hence reduces the "added" error. If N unbiased classifiers are combined by simple averaging, the added error rate can be reduced by a factor of N if the individual errors in approximating the decision boundaries are uncorrelated. Expressions are then derived for linear combiners which are biased or correlated, and the effect of output correlations on ensemble performance is quantified. For order statistics based non-linear combiners, we derive expressions that indicate how much the median, the maximum and in general the ith order statistic can improve classifier performance. The analysis presented here facilitates the understanding of the relationships among error rates, classifier boundary distributions, and combining in output space. Experimental results on several public domain data sets are provided to illustrate the benefits of combining and to support the analytical results.Comment: 31 page

    Application of Machine Learning Techniques in Credit Card Fraud Detection

    Full text link
    Credit card fraud is an ever-growing problem in today’s financial market. There has been a rapid increase in the rate of fraudulent activities in recent years causing a substantial financial loss to many organizations, companies, and government agencies. The numbers are expected to increase in the future, because of which, many researchers in this field have focused on detecting fraudulent behaviors early using advanced machine learning techniques. However, the credit card fraud detection is not a straightforward task mainly because of two reasons: (i) the fraudulent behaviors usually differ for each attempt and (ii) the dataset is highly imbalanced, i.e., the frequency of majority samples (genuine cases) outnumbers the minority samples (fraudulent cases). When providing input data of a highly unbalanced class distribution to the predictive model, the model tends to be biased towards the majority samples. As a result, it tends to misrepresent a fraudulent transaction as a genuine transaction. To tackle this problem, data-level approach, where different resampling methods such as undersampling, oversampling, and hybrid strategies, have been implemented along with an algorithmic approach where ensemble models such as bagging and boosting have been applied to a highly skewed dataset containing 284807 transactions. Out of these transactions, only 492 transactions are labeled as fraudulent. Predictive models such as logistic regression, random forest, and XGBoost in combination with different resampling techniques have been applied to predict if a transaction is fraudulent or genuine. The performance of the model is evaluated based on recall, precision, f1-score, precision-recall (PR) curve, and receiver operating characteristics (ROC) curve. The experimental results showed that random forest in combination with a hybrid resampling approach of Synthetic Minority Over-sampling Technique (SMOTE) and Tomek Links removal performed better than other models

    Methods to Improve the Prediction Accuracy and Performance of Ensemble Models

    Get PDF
    The application of ensemble predictive models has been an important research area in predicting medical diagnostics, engineering diagnostics, and other related smart devices and related technologies. Most of the current predictive models are complex and not reliable despite numerous efforts in the past by the research community. The performance accuracy of the predictive models have not always been realised due to many factors such as complexity and class imbalance. Therefore there is a need to improve the predictive accuracy of current ensemble models and to enhance their applications and reliability and non-visual predictive tools. The research work presented in this thesis has adopted a pragmatic phased approach to propose and develop new ensemble models using multiple methods and validated the methods through rigorous testing and implementation in different phases. The first phase comprises of empirical investigations on standalone and ensemble algorithms that were carried out to ascertain their performance effects on complexity and simplicity of the classifiers. The second phase comprises of an improved ensemble model based on the integration of Extended Kalman Filter (EKF), Radial Basis Function Network (RBFN) and AdaBoost algorithms. The third phase comprises of an extended model based on early stop concepts, AdaBoost algorithm, and statistical performance of the training samples to minimize overfitting performance of the proposed model. The fourth phase comprises of an enhanced analytical multivariate logistic regression predictive model developed to minimize the complexity and improve prediction accuracy of logistic regression model. To facilitate the practical application of the proposed models; an ensemble non-invasive analytical tool is proposed and developed. The tool links the gap between theoretical concepts and practical application of theories to predict breast cancer survivability. The empirical findings suggested that: (1) increasing the complexity and topology of algorithms does not necessarily lead to a better algorithmic performance, (2) boosting by resampling performs slightly better than boosting by reweighting, (3) the prediction accuracy of the proposed ensemble EKF-RBFN-AdaBoost model performed better than several established ensemble models, (4) the proposed early stopped model converges faster and minimizes overfitting better compare with other models, (5) the proposed multivariate logistic regression concept minimizes the complexity models (6) the performance of the proposed analytical non-invasive tool performed comparatively better than many of the benchmark analytical tools used in predicting breast cancers and diabetics ailments. The research contributions to ensemble practice are: (1) the integration and development of EKF, RBFN and AdaBoost algorithms as an ensemble model, (2) the development and validation of ensemble model based on early stop concepts, AdaBoost, and statistical concepts of the training samples, (3) the development and validation of predictive logistic regression model based on breast cancer, and (4) the development and validation of a non-invasive breast cancer analytic tools based on the proposed and developed predictive models in this thesis. To validate prediction accuracy of ensemble models, in this thesis the proposed models were applied in modelling breast cancer survivability and diabetics’ diagnostic tasks. In comparison with other established models the simulation results of the models showed improved predictive accuracy. The research outlines the benefits of the proposed models, whilst proposes new directions for future work that could further extend and improve the proposed models discussed in this thesis

    IMPROVING LABEL PREDICTION IN SOCIAL NETWORKS BY ADDING NOISE

    Get PDF
    Social Networks like Facebook and Linkedin have grown tremendously over the las

    QUANTITATIVE IMAGING FOR PRECISION MEDICINE IN HEAD AND NECK CANCER PATIENTS

    Get PDF
    The purpose of this work was to determine if prediction models using quantitative imaging measures in head and neck squamous cell carcinoma (HNSCC) patients could be improved when noise due to imaging was reduced. This was investigated separately for salivary gland function using dynamic contrast enhanced magnetic resonance imaging (DCE-MRI), overall survival using computed tomography (CT)-based radiomics, and overall survival using positron emission tomography (PET)-based radiomics. From DCE-MRI, where T1-weighted images are serially acquired after injection of contrast, quantitative measures of diffusion can be obtained from the series of images. Radiomics is the study of the relationship of voxels to one another providing measures of texture from the area of interest. Quantitative information obtained from imaging could help in radiation treatment planning by providing quantifiable spatial information with computational models for assigning dose to regions to improve patient outcome, both survival and quality of life. By reducing the noise within the quantitative data, the prediction accuracy could improve to move this type of work closer to clinical practice. For each imaging modality sources of noise that could impact the patient analysis were identified, quantified, and if possible minimized during the patient analysis. In MRI, a large potential source of uncertainty was the image registration. To evaluate this, both physical and synthetic phantoms were used, which showed that registration of MR images was high, with all root mean square errors below 3 mm. Then, 15 HNSCC patients with pre-, mid-, and post-treatment DCE-MRI scans were evaluated. However, differences in algorithm output were found to be a large source of noise as different algorithms could not consistently rank patients as above or below the median for quantitative metrics from DCE-MRI. Therefore, further analysis using this modality was not pursued. In CT, a large potential source of noise that could impact patient analysis was the inter-scanner variability. To investigate this a controlled protocol was designed and used to image, along with the local head and chest protocols, a radiomics phantom on 100 CT scanners. This demonstrated that the inter-scanner variability could be reduced by over 50% using a controlled protocol compared to local protocols. Additionally, it was shown that the reconstruction parameters impact feature values while most acquisition parameters do not, therefore, most of this benefit can be achieved using a radiomics reconstruction with no additional dose to the patient. Then to evaluate this impact in patient studies, 726 HNSCC patients with CT images were used to create and test a Cox proportional hazards model for overall survival. Those patients with the same imaging protocol were subset and a new Cox proportional hazards model was created and tested in order to determine if the reduction in noise due to controlling the imaging protocol translated into improved prediction. However, noise between patient populations from different institutions was shown to be larger than the reduction in noise due to a controlled imaging protocol. In PET, a large potential source of noise that could impact patient analysis was the imaging protocol. A phantom scanned on three different scanners and vendors demonstrated that on a single vendor, imaging parameter choices did not affect radiomics feature values, but inter-scanner variances could be large. Then, 686 HNSCC patients with PET images were used to create and test a Cox proportional hazards model for overall survival. Those patients with the same imaging protocol were subset and a new Cox proportional hazards model was created and tested in order to determine if the reduction in noise due to controlling the imaging protocol on a vendor translated into improved prediction. However, no predictive radiomics signature could be determined for any subset of the patient cohort that resulted in significant stratification of patients into high and low risk. This study demonstrated that the imaging variability could be quantified and controlled for in each modality. However, for each modality there were larger sources of noise identified that did not allow for improvement in prediction modeling of salivary gland function or overall survival using quantitative imaging metrics for MRI, CT, or PET
    • …
    corecore