1,178 research outputs found

    Credit scoring using genetic programming

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsGrowing numbers in e-commerce orders lead to an increase in risk management to prevent default in payment. Default in payment is the failure of a customer to settle a bill within 90 days upon receipt. Frequently, credit scoring is employed to identify customers’ default probability. Credit scoring has been widely studied and many different methods in different fields of research have been proposed. The primary aim of this work is to develop a credit scoring model as a replacement for the pre risk check of the e-commerce risk management system risk solution services (rss). The pre risk check uses data of the order process and includes exclusion rules and a generic credit scoring model. The new model is supposed to work as a replacement for the whole pre risk check and has to be able to work in solitary and in unison with the rss main risk check. An application of Genetic Programming to credit scoring is presented. The model is developed on a real world data set provided by Arvato Financial Solutions. The data set contains order requests processed by rss. Results show that Genetic Programming outperforms the generic credit scoring model of the pre risk check in both classification accuracy and profit. Compared with Logistic Regression, Support Vector Machines and Boosted Trees, Genetic Programming achieved a similar classificatory accuracy. Furthermore, the Genetic Programming model can be used in combination with the rss main risk check in order to create a model with higher discriminatory power than its individual models

    Adaptive rule-based malware detection employing learning classifier systems

    Get PDF
    Efficient and accurate malware detection is increasingly becoming a necessity for society to operate. Existing malware detection systems have excellent performance in identifying known malware for which signatures are available, but poor performance in anomaly detection for zero day exploits for which signatures have not yet been made available or targeted attacks against a specific entity. The primary goal of this thesis is to provide evidence for the potential of learning classier systems to improve the accuracy of malware detection. A customized system based on a state-of-the-art learning classier system is presented for adaptive rule-based malware detection, which combines a rule-based expert system with evolutionary algorithm based reinforcement learning, thus creating a self-training adaptive malware detection system which dynamically evolves detection rules. This system is analyzed on a benchmark of malicious and non-malicious files. Experimental results show that the system can outperform C4.5, a well-known non-adaptive machine learning algorithm, under certain conditions. The results demonstrate the system\u27s ability to learn effective rules from repeated presentations of a tagged training set and show the degree of generalization achieved on an independent test set. This thesis is an extension and expansion of the work published in the Security, Trust, and Privacy for Software Applications workshop in COMPSAC 2011 - the 35th Annual IEEE Signature Conference on Computer Software and Applications --Abstract, page iii

    Detection of Thin Boundaries between Different Types of Anomalies in Outlier Detection using Enhanced Neural Networks

    Full text link
    Outlier detection has received special attention in various fields, mainly for those dealing with machine learning and artificial intelligence. As strong outliers, anomalies are divided into the point, contextual and collective outliers. The most important challenges in outlier detection include the thin boundary between the remote points and natural area, the tendency of new data and noise to mimic the real data, unlabelled datasets and different definitions for outliers in different applications. Considering the stated challenges, we defined new types of anomalies called Collective Normal Anomaly and Collective Point Anomaly in order to improve a much better detection of the thin boundary between different types of anomalies. Basic domain-independent methods are introduced to detect these defined anomalies in both unsupervised and supervised datasets. The Multi-Layer Perceptron Neural Network is enhanced using the Genetic Algorithm to detect newly defined anomalies with higher precision so as to ensure a test error less than that calculated for the conventional Multi-Layer Perceptron Neural Network. Experimental results on benchmark datasets indicated reduced error of anomaly detection process in comparison to baselines

    Investigation of artificial immune systems and variable selection techniques for credit scoring

    Get PDF
    Most lending institutions are aware of the importance of having a well-performing credit scoring model or scorecard and know that, in order to remain competitive in the credit industry, it is necessary to continuously improve their scorecards. This is because better scorecards result in substantial monetary savings that can be stated in terms of millions of dollars. Thus, there has been increasing interest in the application of new classifiers in credit scoring from both practitioners and researchers in the last few decades. Most of the recent work in this field has focused on the use of new and innovative techniques to classify applicants as either 'credit-worthy' or 'non-credit-worthy', with the aim of improving scorecard performance. In this thesis, we investigate the suitability of intelligent systems techniques for credit scoring. In particular, intelligent systems that use immunological metaphors are examined and used to build a learning and evolutionary classification algorithm. Our model, named Simple Artificial Immune System (SAIS), is based on the concepts of the natural immune system. The model uses applicants' credit details to classify them as either 'credit-worthy' or 'non-credit-worthy'. As part of the model development, we also investigate several techniques for selecting variables from the applicants' credit details. Variable selection is important as choosing the best set of variables can have a significant effect on the performance of scorecards. Interestingly, our results demonstrate that the traditional stepwise regression variable selection technique seems to perform better than many of the more recent techniques. A further contribution offered by this thesis is a detailed description of the scorecard development process. A detailed explanation of this process is not readily available in the literature and our description of the process is based on our own experiences and discussions with industry credit risk practitioners. We evaluate our model using both publicly available datasets as well as a very large set of real-world consumer credit scoring data obtained from a leading Australian bank. The evaluation results reveal that SAIS is a competitive classifier and is appropriate for developing scorecards which require a class decision as an outcome. Another conclusion reached is one confirmed by the existing literature, that even though more sophisticated scorecard development techniques, including SAIS, perform well compared to the traditional statistical methods, their performances are not statistically significantly different from the statistical methods. As with other intelligent systems techniques, SAIS is not explicitly designed to develop practical scorecards which require the generation of a score that represents the degree of confidence that an applicant will belong to a particular group. However, it is comparable to other intelligent systems techniques which are outperformed by statistical techniques for generating p ractical scorecards. Our final remark on this research is that even though SAIS does not seem to be quite suitable for developing practical scorecards, we still believe that there is room for improvement and that the natural immune system of the body has a number of avenues yet to be explored which could assist with the development of practical scorecards

    Genetic Programming for Classification with Unbalanced Data

    No full text
    In classification,machine learning algorithms can suffer a performance bias when data sets are unbalanced. Binary data sets are unbalanced when one class is represented by only a small number of training examples (called the minority class), while the other class makes up the rest (majority class). In this scenario, the induced classifiers typically have high accuracy on the majority class but poor accuracy on the minority class. As the minority class typically represents the main class-of-interest in many real-world problems, accurately classifying examples from this class can be at least as important as, and in some cases more important than, accurately classifying examples from the majority class. Genetic Programming (GP) is a promising machine learning technique based on the principles of Darwinian evolution to automatically evolve computer programs to solve problems. While GP has shown much success in evolving reliable and accurate classifiers for typical classification tasks with balanced data, GP, like many other learning algorithms, can evolve biased classifiers when data is unbalanced. This is because traditional training criteria such as the overall success rate in the fitness function in GP, can be influenced by the larger number of examples from the majority class. This thesis proposes a GP approach to classification with unbalanced data. The goal is to develop new internal cost-adjustment techniques in GP to improve classification performances on both the minority class and the majority class. By focusing on internal cost-adjustment within GP rather than the traditional databalancing techniques, the unbalanced data can be used directly or "as is" in the learning process. This removes any dependence on a sampling algorithm to first artificially re-balance the input data prior to the learning process. This thesis shows that by developing a number of new methods in GP, genetic program classifiers with good classification ability on the minority and the majority classes can be evolved. This thesis evaluates these methods on a range of binary benchmark classification tasks with unbalanced data. This thesis demonstrates that unlike tasks with multiple balanced classes where some dynamic (non-static) classification strategies perform significantly better than the simple static classification strategy, either a static or dynamic strategy shows no significant difference in the performance of evolved GP classifiers on these binary tasks. For this reason, the rest of the thesis uses this static classification strategy. This thesis proposes several new fitness functions in GP to perform cost adjustment between the minority and the majority classes, allowing the unbalanced data sets to be used directly in the learning process without sampling. Using the Area under the Receiver Operating Characteristics (ROC) curve (also known as the AUC) to measure how well a classifier performs on the minority and majority classes, these new fitness functions find genetic program classifiers with high AUC on the tasks on both classes, and with fast GP training times. These GP methods outperform two popular learning algorithms, namely, Naive Bayes and Support Vector Machines on the tasks, particularly when the level of class imbalance is large, where both algorithms show biased classification performances. This thesis also proposes a multi-objective GP (MOGP) approach which treats the accuracies of the minority and majority classes separately in the learning process. The MOGP approach evolves a good set of trade-off solutions (a Pareto front) in a single run that perform as well as, and in some cases better than, multiple runs of canonical single-objective GP (SGP). In SGP, individual genetic program solutions capture the performance trade-off between the two objectives (minority and majority class accuracy) using an ROC curve; whereas in MOGP, this requirement is delegated to multiple genetic program solutions along the Pareto front. This thesis also shows how multiple Pareto front classifiers can be combined into an ensemble where individual members vote on the class label. Two ensemble diversity measures are developed in the fitness functions which treat the diversity on both the minority and the majority classes as equally important; otherwise, these measures risk being biased toward the majority class. The evolved ensembles outperform their individual members on the tasks due to good cooperation between members. This thesis further improves the ensemble performances by developing a GP approach to ensemble selection, to quickly find small groups of individuals that cooperate very well together in the ensemble. The pruned ensembles use much fewer individuals to achieve performances that are as good as larger (unpruned) ensembles, particularly on tasks with high levels of class imbalance, thereby reducing the total time to evaluate the ensemble

    An advance extended binomial GLMBoost ensemble method with synthetic minority over-sampling technique for handling imbalanced datasets

    Get PDF
    Classification is an important activity in a variety of domains. Class imbalance problem have reduced the performance of the traditional classification approaches. An imbalance problem arises when mismatched class distributions are discovered among the instances of class of classification datasets. An advance extended binomial GLMBoost (EBGLMBoost) coupled with synthetic minority over-sampling technique (SMOTE) technique is the proposed model in the study to manage imbalance issues. The SMOTE is used to solve the proposed model, ensuring that the target variable's distribution is balanced, whereas the GLMBoost ensemble techniques are built to deal with imbalanced datasets. For the entire experiment, twenty different datasets are used, and support vector machine (SVM), Nu-SVM, bagging, and AdaBoost classification algorithms are used to compare with the suggested method. The model's sensitivity, specificity, geometric mean (G-mean), precision, recall, and F-measure resulted in percentages for training and testing datasets are 99.37, 66.95, 80.81, 99.21, 99.37, 99.29 and 98.61, 54.78, 69.88, 98.77, 96.61, 98.68, respectively. With the help of the Wilcoxon test, it is determined that the proposed technique performed well on unbalanced data. Finally, the proposed solutions are capable of efficiently dealing with the problem of class imbalance

    Android Malware Detection System using Genetic Programming

    Get PDF
    Nowadays, smartphones and other mobile devices are playing a significant role in the way people engage in entertainment, communicate, network, work, and bank and shop online. As the number of mobile phones sold has increased dramatically worldwide, so have the security risks faced by the users, to a degree most do not realise. One of the risks is the threat from mobile malware. In this research, we investigate how supervised learning with evolutionary computation can be used to synthesise a system to detect Android mobile phone attacks. The attacks include malware, ransomware and mobile botnets. The datasets used in this research are publicly downloadable, available for use with appropriate acknowledgement. The primary source is Drebin. We also used ransomware and mobile botnet datasets from other Android mobile phone researchers. The research in this thesis uses Genetic Programming (GP) to evolve programs to distinguish malicious and non-malicious applications in Android mobile datasets. It also demonstrates the use of GP and Multi-Objective Evolutionary Algorithms (MOEAs) together to explore functional (detection rate) and non-functional (execution time and power consumption) trade-offs. Our results show that malicious and non-malicious applications can be distinguished effectively using only the permissions held by applications recorded in the application's Android Package (APK). Such a minimalist source of features can serve as the basis for highly efficient Android malware detection. Non-functional tradeoffs are also highlight

    Information gain directed genetic algorithm wrapper feature selection for credit rating

    Get PDF
    Financial credit scoring is one of the most crucial processes in the finance industry sector to be able to assess the credit-worthiness of individuals and enterprises. Various statistics-based machine learning techniques have been employed for this task. “Curse of Dimensionality” is still a significant challenge in machine learning techniques. Some research has been carried out on Feature Selection (FS) using genetic algorithm as wrapper to improve the performance of credit scoring models. However, the challenge lies in finding an overall best method in credit scoring problems and improving the time-consuming process of feature selection. In this study, the credit scoring problem is investigated through feature selection to improve classification performance. This work proposes a novel approach to feature selection in credit scoring applications, called as Information Gain Directed Feature Selection algorithm (IGDFS), which performs the ranking of features based on information gain, propagates the top m features through the GA wrapper (GAW) algorithm using three classical machine learning algorithms of KNN, Naïve Bayes and Support Vector Machine (SVM) for credit scoring. The first stage of information gain guided feature selection can help reduce the computing complexity of GA wrapper, and the information gain of features selected with the IGDFS can indicate their importance to decision making
    corecore