Search CORE

801 research outputs found

An advance extended binomial GLMBoost ensemble method with synthetic minority over-sampling technique for handling imbalanced datasets

Author: Mallick Manas Kumar
Mishra Debahuti
Rout Neelam
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 31/03/2023
Field of study

Classification is an important activity in a variety of domains. Class imbalance problem have reduced the performance of the traditional classification approaches. An imbalance problem arises when mismatched class distributions are discovered among the instances of class of classification datasets. An advance extended binomial GLMBoost (EBGLMBoost) coupled with synthetic minority over-sampling technique (SMOTE) technique is the proposed model in the study to manage imbalance issues. The SMOTE is used to solve the proposed model, ensuring that the target variable's distribution is balanced, whereas the GLMBoost ensemble techniques are built to deal with imbalanced datasets. For the entire experiment, twenty different datasets are used, and support vector machine (SVM), Nu-SVM, bagging, and AdaBoost classification algorithms are used to compare with the suggested method. The model's sensitivity, specificity, geometric mean (G-mean), precision, recall, and F-measure resulted in percentages for training and testing datasets are 99.37, 66.95, 80.81, 99.21, 99.37, 99.29 and 98.61, 54.78, 69.88, 98.77, 96.61, 98.68, respectively. With the help of the Wilcoxon test, it is determined that the proposed technique performed well on unbalanced data. Finally, the proposed solutions are capable of efficiently dealing with the problem of class imbalance

Institute of Advanced Engineering and Science

XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning

Author: Hryniewicki Maciej K.
Zhao Yue
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/11/2019
Field of study

A new semi-supervised ensemble algorithm called XGBOD (Extreme Gradient Boosting Outlier Detection) is proposed, described and demonstrated for the enhanced detection of outliers from normal observations in various practical datasets. The proposed framework combines the strengths of both supervised and unsupervised machine learning methods by creating a hybrid approach that exploits each of their individual performance capabilities in outlier detection. XGBOD uses multiple unsupervised outlier mining algorithms to extract useful representations from the underlying data that augment the predictive capabilities of an embedded supervised classifier on an improved feature space. The novel approach is shown to provide superior performance in comparison to competing individual detectors, the full ensemble and two existing representation learning based algorithms across seven outlier datasets.Comment: Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN

arXiv.org e-Print Archive

Predicting Patient Satisfaction With Ensemble Methods

Author: Rosales Elisa Renee
Publication venue: Digital WPI
Publication date: 30/04/2015
Field of study

Health plans are constantly seeking ways to assess and improve the quality of patient experience in various ambulatory and institutional settings. Standardized surveys are a common tool used to gather data about patient experience, and a useful measurement taken from these surveys is known as the Net Promoter Score (NPS). This score represents the extent to which a patient would, or would not, recommend his or her physician on a scale from 0 to 10, where 0 corresponds to Extremely unlikely and 10 to Extremely likely . A large national health plan utilized automated calls to distribute such a survey to its members and was interested in understanding what factors contributed to a patient\u27s satisfaction. Additionally, they were interested in whether or not NPS could be predicted using responses from other questions on the survey, along with demographic data. When the distribution of various predictors was compared between the less satisfied and highly satisfied members, there was significant overlap, indicating that not even the Bayes Classifier could successfully differentiate between these members. Moreover, the highly imbalanced proportion of NPS responses resulted in initial poor prediction accuracy. Thus, due to the non-linear structure of the data, and high number of categorical predictors, we have leveraged flexible methods, such as decision trees, bagging, and random forests, for modeling and prediction. We further altered the prediction step in the random forest algorithm in order to account for the imbalanced structure of the data

Early hospital mortality prediction using vital signals

Author: Banerjee Tanvi
Romine William
Sadeghi Reza
Publication venue: 'Elsevier BV'
Publication date: 01/12/2018
Field of study

Early hospital mortality prediction is critical as intensivists strive to make efficient medical decisions about the severely ill patients staying in intensive care units. As a result, various methods have been developed to address this problem based on clinical records. However, some of the laboratory test results are time-consuming and need to be processed. In this paper, we propose a novel method to predict mortality using features extracted from the heart signals of patients within the first hour of ICU admission. In order to predict the risk, quantitative features have been computed based on the heart rate signals of ICU patients. Each signal is described in terms of 12 statistical and signal-based features. The extracted features are fed into eight classifiers: decision tree, linear discriminant, logistic regression, support vector machine (SVM), random forest, boosted trees, Gaussian SVM, and K-nearest neighborhood (K-NN). To derive insight into the performance of the proposed method, several experiments have been conducted using the well-known clinical dataset named Medical Information Mart for Intensive Care III (MIMIC-III). The experimental results demonstrate the capability of the proposed method in terms of precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). The decision tree classifier satisfies both accuracy and interpretability better than the other classifiers, producing an F1-score and AUC equal to 0.91 and 0.93, respectively. It indicates that heart rate signals can be used for predicting mortality in patients in the ICU, achieving a comparable performance with existing predictions that rely on high dimensional features from clinical records which need to be processed and may contain missing information.Comment: 11 pages, 5 figures, preprint of accepted paper in IEEE&ACM CHASE 2018 and published in Smart Health journa

arXiv.org e-Print Archive

Efficient Fraud Detection in Ethereum Blockchain through Machine Learning and Deep Learning Approaches

Author: Siddamsetti Swapna
Srivenkatesh Muktevi
Publication venue: Auricle Global Society of Education and Research
Publication date: 07/10/2023
Field of study

Background: This paper tackles the critical challenge of detecting fraudulent transactions within the Ethereum blockchain using machine learning techniques. With the burgeoning importance of blockchain, ensuring its security against fraudulent activities is crucial to prevent significant monetary losses. We utilized a public dataset comprising 9,841 Ethereum transactions, characterized by attributes such as gas price, transaction fee, and timestamp.Methods: Our approach is bifurcated into two core phases: data preprocessing and predictive modeling. In the data preprocessing phase, we meticulously process the dataset and extract pivotal features from transactions, setting the stage for efficient predictive modeling.Findings: For predictive modeling, we employed several machine learning algorithms to discern between fraudulent and legitimate transactions. Our evaluation encompassed algorithms like decision trees, logistic regression, gradient boosting, XGBoost, and an innovative hybrid model that melds random forests with deep neural networks (DNN).Novelty: Our findings underscore that the proposed model boasts a precision rate of 97.16%, marking a substantial leap in fraudulent transaction detection on the Ethereum blockchain in comparison to prevailing methodologies. This paper augments the current efforts aimed at bolstering the security of blockchain transactions using sophisticated analytical strategies.

International Journal on Recent and Innovation Trends in Computing and Communication

AdaCC: cumulative cost-sensitive boosting for imbalanced classification

Author: Iosifidis Vasileios
Ntoutsi Eirini
Papadopoulos Symeon
Rosenhahn Bodo
Publication venue
Publication date: 01/01/2023
Field of study

Class imbalance poses a major challenge for machine learning as most supervised learning models might exhibit bias towards the majority class and under-perform in the minority class. Cost-sensitive learning tackles this problem by treating the classes differently, formulated typically via a user-defined fixed misclassification cost matrix provided as input to the learner. Such parameter tuning is a challenging task that requires domain knowledge and moreover, wrong adjustments might lead to overall predictive performance deterioration. In this work, we propose a novel cost-sensitive boosting approach for imbalanced data that dynamically adjusts the misclassification costs over the boosting rounds in response to model’s performance instead of using a fixed misclassification cost matrix. Our method, called AdaCC, is parameter-free as it relies on the cumulative behavior of the boosting model in order to adjust the misclassification costs for the next boosting round and comes with theoretical guarantees regarding the training error. Experiments on 27 real-world datasets from different domains with high class imbalance demonstrate the superiority of our method over 12 state-of-the-art cost-sensitive boosting approaches exhibiting consistent improvements in different measures, for instance, in the range of [0.3–28.56%] for AUC, [3.4–21.4%] for balanced accuracy, [4.8–45%] for gmean and [7.4–85.5%] for recall