17 research outputs found

    Robust Weighted Kernel Logistic Regression in Imbalanced and Rare Events Data

    Get PDF
    Recent developments in computing and technology, along with the availability of large amounts of raw data, have contributed to the creation of many effective techniques and algorithms in the fields of pattern recognition and machine learning. Some of the main objectives for developing these algorithms are to identify patterns within the available data or to make predictions, or both. Great success has been achieved with many classification techniques in real-life applications. Concerning binary data classification in particular, analysis of data containing rare events or disproportionate class distributions poses a great challenge to industry and to the machine learning community. This study examines rare events (REs) with binary dependent variables containing many times more non-events (zeros) than events (ones). These variables are difficult to predict and to explain as has been demonstrated in the literature. This research combines rare events corrections on Logistic Regression (LR) with truncated-Newton methods and applies these techniques on Kernel Logistic Regression (KLR). The resulting model, Rare-Event Weighted Kernel Logistic Regression (RE-WKLR) is a combination of weighting, regularization, approximate numerical methods, kernelization, bias correction, and efficient implementation, all of which enable RE-WKLR to be at once fast, accurate, and robust

    GERAÇÃO GENÉTICA DE CLASSIFICADORES FUZZY PARA BASES DE DADOS DESBALANCEADAS

    Get PDF
    Eventos raros, padrões não usuais e comportamentos anormais são difíceis de serem detectados e frequentemente exigem respostas em tempo hábil (HAIXIANG et al, 2017). Eventos raros se referem aos eventos que acontecem com uma frequência muito menor em relação aos eventos comuns (MAALOUF; TRAFALIS, 2011). Exemplos de eventos raros são detecção de defeitos em software (RODRIGUEZ et al, 2014), desastres naturais (HAIXIANG et al, 2017), detecção de fraudes em transações financeiras (PANIGRAHI, S. et al, 2009), dentre outros

    Video Genre Classification Using Weighted Kernel Logistic Regression

    Get PDF
    Due to the widening semantic gap of videos, computational tools to classify these videos into different genre are highly needed to narrow it. Classifying videos accurately demands good representation of video data and an efficient and effective model to carry out the classification task. Kernel Logistic Regression (KLR), kernel version of logistic regression (LR), proves its efficiency as a classifier, which can naturally provide probabilities and extend to multiclass classification problems. In this paper, Weighted Kernel Logistic Regression (WKLR) algorithm is implemented for video genre classification to obtain significant accuracy, and it shows accurate and faster good results

    Resampling Logistic Regression Untuk Penanganan Ketidakseimbangan Class Pada Prediksi Cacat Software

    Full text link
    Software yang berkualitas tinggi adalah software yang dapat membantu proses bisnis Perusahaan dengan efektif, efesien dan tidak ditemukan cacat selama proses pengujian, pemeriksaan, dan implementasi. Perbaikan software setelah pengirimana dan implementasi, membutuhkan biaya jauh lebih mahal dari pada saat pengembangan. Biaya yang dibutuhkan untuk pengujian software menghabisakan lebih dari 50% dari biaya pengembangan. Dibutuhkan model pengujian cacat software untuk mengurangi biaya yang dikeluarkan. Saat ini belum ada model prediksi cacat software yang berlaku umum pada saat digunakan digunakan. Model Logistic Regression merupakan model paling efektif dan efesien dalam prediksi cacat software. Kelemahan dari Logistic Regression adalah rentan terhadap underfitting pada dataset yang kelasnya tidak seimbang, sehingga akan menghasilkan akurasi yang rendah. Dataset NASA MDP adalah dataset umum yang digunakan dalam prediksi cacat software. Salah satu karakter dari dataset prediksi cacat software, termasuk didalamnya dataset NASA MDP adalah memiliki ketidakseimbangan pada kelas. Untuk menangani masalah ketidakseimbangan kelas pada dataset cacat software pada penelitian ini diusulkan metode resampling. Eksperimen dilakukan untuk membandingkan hasil kinerja Logistic Regression sebelum dan setelah diterapkan metode resampling. Demikian juga dilakukan eksperimen untuk membandingkan metode yang diusulkan hasil pengklasifikasi lain seperti Naïve Bayes, Linear Descriminant Analysis, C4.5, Random Forest, Neural Network, k-Nearest Network. Hasil eksperimen menunjukkan bahwa tingkat akurasi Logistic Regression dengan resampling lebih tinggi dibandingkan dengan metode Logistric Regression yang tidak menggunakan resampling, demikian juga bila dibandingkan dengan pengkalisifkasi yang lain. Dari hasil eksperimen di atas dapat disimpulkan bahwa metode resampling terbukti efektif dalam menyelesaikan ketidakseimbangan kelas pada prediksi cacat software dengan algoritma Logistic Regression

    Predicting class-imbalanced business risk using resampling, regularization, and model ensembling algorithms

    Get PDF
    We aim at developing and improving the imbalanced business risk modeling via jointly using proper evaluation criteria, resampling, cross-validation, classifier regularization, and ensembling techniques. Area Under the Receiver Operating Characteristic Curve (AUC of ROC) is used for model comparison based on 10-fold cross-validation. Two undersampling strategies including random undersampling (RUS) and cluster centroid undersampling (CCUS), as well as two oversampling methods including random oversampling (ROS) and Synthetic Minority Oversampling Technique (SMOTE), are applied. Three highly interpretable classifiers, including logistic regression without regularization (LR), L1-regularized LR (L1LR), and decision tree (DT) are implemented. Two ensembling techniques, including Bagging and Boosting, are applied to the DT classifier for further model improvement. The results show that Boosting on DT by using the oversampled data containing 50% positives via SMOTE is the optimal model and it can achieve AUC, recall, and F1 score valued 0.8633, 0.9260, and 0.8907, respectively

    Comparison of classification algorithms to predict outcomes of feedlot cattle identified and treated for Bovine Respiratory Disease

    Get PDF
    Bovine respiratory disease (BRD) continues to be the primary cause of morbidity and mortality in feedyard cattle. Accurate identification of those animals that will not finish the production cycle normally following initial treatment for BRD would provide feedyard managers with opportunities to more effectively manage those animals. Our objectives were to assess the ability of different classification algorithms to accurately predict an individual calf’s outcome based on data available at first identification of and treatment for BRD and also to identify characteristics of calves where predictive models performed well as gauged by accuracy. Data from 23 feedyards in multiple geographic locations within the U.S. from 2000 to 2009 representing over one million animals were analyzed to identify animals clinically diagnosed with BRD and treated with an antimicrobial. These data were analyzed both as a single dataset and as multiple datasets based on individual feedyards and partitioned into training, testing, and validation datasets. Classifiers were trained and optimized to identify calves that did not finish the production cycle with their cohort. Following classifier training, accuracy was evaluated using validation data. Analysis was also done to identify sub-groups of calves within populations where classifiers performed better compared to other sub-groups. Accuracy of individual classifiers varied by dataset. The accuracy of the best performing classifier by dataset ranged from a low of 63% in one dataset up to 95% in a different dataset. Sub-groups of calves were identified within some datasets where accuracy of a classifiers were greater than 98%; however these accuracies must be interpreted in relation to the prevalence of the class of interest within those populations. We found that by pairing the correct classifier with the data available, accurate predictions could be made that would provide feedlot managers with valuable information

    RiskLogitboost Regression for Rare Events in Binary Response: An Econometric Approach

    Get PDF
    A boosting-based machine learning algorithm is presented to model a binary response with large imbalance, i.e., a rare event. The new method (i) reduces the prediction error of the rare class, and (ii) approximates an econometric model that allows interpretability. RiskLogitboost regression includes a weighting mechanism that oversamples or undersamples observations according to their misclassification likelihood and a generalized least squares bias correction strategy to reduce the prediction error. An illustration using a real French third-party liability motor insurance data set is presented. The results show that RiskLogitboost regression improves the rate of detection of rare events compared to some boosting-based and tree-based algorithms and some existing methods designed to treat imbalanced responses

    Using data mining to predict automobile insurance fraud

    Get PDF
    This thesis presents a study on the issue of Automobile Insurance Fraud. The purpose of this study is to increase knowledge concerning fraudulent claims in the Portuguese market, while raising awareness to the use of Data Mining techniques towards this, and other similar problems. We conduct an application of data mining techniques to the problem of predicting automobile insurance fraud, shown to be of interest to insurance companies around the world. We present fraud definitions and conduct an overview of existing literature on the subject. Live policy and claim data from the Portuguese insurance market in 2005 is used to train a Logit Regression Model and a CHAID Classification and Regression Tree. The use of Data Mining tools and techniques enabled the identification of underlying fraud patterns, specific to the raw data used to build the models. The list of potential fraud indicators includes variables such as the policy’s tenure, the number of policy holders, not admitting fault in the accident or fractioning premium payments semiannually. Other variables such as the number of days between the accident and the patient filing the claim, the client’s age, and the geographical location of the accident were also found to be relevant in specific sub-populations of the used dataset. Model variables and coefficients are interpreted comparatively and key performance results are presented, including PCC, sensitivity, specificity and AUROC. Both the Logit Model and the CHAID C&R Tree achieve fair results in predicting automobile insurance fraud in the used dataset
    corecore