19 research outputs found

    Inspecting Credit Card Fraud Identification Via Data Mining Classification Methods And Machine Learning Algorithms

    Get PDF
    Increased global fraud cases and significant losses for both the financial sector and people are brought about by the quick adoption of online-based transactional activity. While credit card fraud is one of the most common and concerning financial industry crimes, internet shoppers are concerned about it more than any other. To investigate the patterns and traits of suspicious and non-suspicious transactions using normalised and anomaly data, data mining techniques were mostly used. Nevertheless, classifiers were utilised in machine learning (ML) techniques to automatically determine which transactions were fraudulent and which were not. Thus, by figuring out the patterns in the data, the combination of data mining and machine learning algorithms was able to distinguish between real and pretend transactions

    Ensemble of Example-Dependent Cost-Sensitive Decision Trees

    Get PDF
    Several real-world classification problems are example-dependent cost-sensitive in nature, where the costs due to misclassification vary between examples and not only within classes. However, standard classification methods do not take these costs into account, and assume a constant cost of misclassification errors. In previous works, some methods that take into account the financial costs into the training of different algorithms have been proposed, with the example-dependent cost-sensitive decision tree algorithm being the one that gives the highest savings. In this paper we propose a new framework of ensembles of example-dependent cost-sensitive decision-trees. The framework consists in creating different example-dependent cost-sensitive decision trees on random subsamples of the training set, and then combining them using three different combination approaches. Moreover, we propose two new cost-sensitive combination approaches; cost-sensitive weighted voting and cost-sensitive stacking, the latter being based on the cost-sensitive logistic regression method. Finally, using five different databases, from four real-world applications: credit card fraud detection, churn modeling, credit scoring and direct marketing, we evaluate the proposed method against state-of-the-art example-dependent cost-sensitive techniques, namely, cost-proportionate sampling, Bayes minimum risk and cost-sensitive decision trees. The results show that the proposed algorithms have better results for all databases, in the sense of higher savings.Comment: 13 pages, 6 figures, Submitted for possible publicatio

    PARAMETER ASOSIASI UNTUK MENENTUKAN KORELASI JURUSAN DAN INDEKS PRESTASI KUMULATIF

    Get PDF
    One of the problems in higher education is the mistake of prospective students in majors selection. This is caused by not paying attention to the suitability of the major in the original school with the chosen major in higher education so that it impacts not only non optimal processing and learning outcomes, such as the low GPA, but also on social life, such as increasing unemployment. The selection of the right major is very important and to help prospective students in choosing it requires an online system that can be accessed by everyone and select original school majors to see conformity with majors in higher education. This system uses association rules and parameters of support and confidence in data mining. The purpose of this research is to determine the correlation between majors in the original school, majors in higher education and the achievement of the GPA through the use of support and confidence parameters that process the knowledge base in the form of an alumni database on the online system created. Training or testing was conducted on 10,254 data in the database and produced new information and knowledge that between the majors of the original school, the choice of majors in higher education and GPA had a strong correlation with the value of confidence reaching 100%

    Machine Learning en la detección de fraudes de comercio electrónico aplicado a los servicios bancarios

    Get PDF
    One of the main risks to which financial institutions are subject are electronic fraud attacks. Billions of dollars in losses are absorbed each year by financial institutions due to fraudulent transactions.This article presents a model that considers the main challenges to design a fraud detection system: a) highly unbalanced classes, b) stationary distribution of data and c) incorporation of online feedback from fraud investigators on transactions labeled suspicious. The implementation of the model in a test dataset allowed to successfully predicting the majority of cases of fraudulent transactions with a minimum percentage of false negatives.Uno de los principales riesgos a los que están sometidas las entidades financieras son los ataques de fraudes electrónicos. Billones de dólares en pérdidas son absorbidas cada año por las entidades financieras debido a transacciones fraudulentas. Este artículo plantea un modelo que considera los principales retos en el diseño de un sistema de detección de fraudes: a) clases altamente desequilibradas, b) distribución de estacionaria de los datos y c) la incorporación en línea de la retroalimentación de los investigadores de fraude ante las transacciones etiquetadas como sospechosas. La implementación del modelo en un conjunto de datos de prueba permitió predecir exitosamente la mayoría de casos de transacciones fraudulentas con un mínimo porcentaje de falsos negativos

    Credit Card Fraud Detection Using Machine Learning As Data Mining Technique

    Get PDF
    The rapid participation in online based transactional activities raises the fraudulent cases all over the world and causes tremendous losses to the individuals and financial industry. Although there are many criminal activities occurring in financial industry, credit card fraudulent activities are among the most prevalent and worried about by online customers. Thus, countering the fraud activities through data mining and machine learning is one of the prominent approaches introduced by scholars intending to prevent the losses caused by these illegal acts. Primarily, data mining techniques were employed to study the patterns and characteristics of suspicious and non-suspicious transactions based on normalized and anomalies data. On the other hand, machine learning (ML) techniques were employed to predict the suspicious and non-suspicious transactions automatically by using classifiers. Therefore, the combination of machine learning and data mining techniques were able to identify the genuine and non-genuine transactions by learning the patterns of the data. This paper discusses the supervised based classification using Bayesian network classifiers namely K2, Tree Augmented Naïve Bayes (TAN), and Naïve Bayes, logistics and J48 classifiers. After preprocessing the dataset using normalization and Principal Component Analysis, all the classifiers achieved more than 95.0% accuracy compared to results attained before preprocessing the dataset

    Comparative Analysis of Different Distributions Dataset by Using Data Mining Techniques on Credit Card Fraud Detection

    Get PDF
    Banks suffer multimillion-dollars losses each year for several reasons, the most important of which is due to credit card fraud. The issue is how to cope with the challenges we face with this kind of fraud. Skewed "class imbalance" is a very important challenge that faces this kind of fraud. Therefore, in this study, we explore four data mining techniques, namely naïve Bayesian (NB),Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and Random Forest (RF), on actual credit card transactions from European cardholders. This paper offers four major contributions. First, we used under-sampling to balance the dataset because of the high imbalance class, implying skewed distribution. Second, we applied NB, SVM, KNN, and RF to under-sampled class to classify the transactions into fraudulent and genuine followed by testing the performance measures using a confusion matrix and comparing them. Third, we adopted cross-validation (CV) with 10 folds to test the accuracy of the four models with a standard deviation followed by comparing the results for all our models. Next, we examined these models against the entire dataset (skewed) using the confusion matrix and AUC (Area Under the ROC Curve) ranking measure to conclude the final results to determine which would be the best model for us to use with a particular type of fraud. The results showing the best accuracy for the NB, SVM, KNN and RF classifiers are 97,80%; 97,46%; 98,16% and 98,23%, respectively. The comparative results have been done by using four-division datasets (75:25), (90:10), (66:34) and (80:20) displayed that the RF performs better than NB, SVM, and KNN, and the results when utilizing our proposed models on the entire dataset (skewed), achieved preferable outcomes to the under-sampled dataset

    Maximizing gain in high-throughput screening using conformal prediction

    Get PDF
    Iterative screening has emerged as a promising approach to increase the efficiency of screening campaigns compared to traditional high throughput approaches. By learning from a subset of the compound library, inferences on what compounds to screen next can be made by predictive models, resulting in more efficient screening. One way to evaluate screening is to consider the cost of screening compared to the gain associated with finding an active compound. In this work, we introduce a conformal predictor coupled with a gain-cost function with the aim to maximise gain in iterative screening. Using this setup we were able to show that by evaluating the predictions on the training data, very accurate predictions on what settings will produce the highest gain on the test data can be made. We evaluate the approach on 12 bioactivity datasets from PubChem training the models using 20% of the data. Depending on the settings of the gain-cost function, the settings generating the maximum gain were accurately identified in 8–10 out of the 12 datasets. Broadly, our approach can predict what strategy generates the highest gain based on the results of the cost-gain evaluation: to screen the compounds predicted to be active, to screen all the remaining data, or not to screen any additional compounds. When the algorithm indicates that the predicted active compounds should be screened, our approach also indicates what confidence level to apply in order to maximize gain. Hence, our approach facilitates decision-making and allocation of the resources where they deliver the most value by indicating in advance the likely outcome of a screening campaign.The research at Swetox (UN) was supported by Knut and Alice Wallenberg Foundation and Swedish Research Council FORMAS. AMA was supported by AstraZeneca
    corecore