465 research outputs found

    Machine learning-driven credit risk: a systemic review

    Get PDF
    Credit risk assessment is at the core of modern economies. Traditionally, it is measured by statistical methods and manual auditing. Recent advances in financial artificial intelligence stemmed from a new wave of machine learning (ML)-driven credit risk models that gained tremendous attention from both industry and academia. In this paper, we systematically review a series of major research contributions (76 papers) over the past eight years using statistical, machine learning and deep learning techniques to address the problems of credit risk. Specifically, we propose a novel classification methodology for ML-driven credit risk algorithms and their performance ranking using public datasets. We further discuss the challenges including data imbalance, dataset inconsistency, model transparency, and inadequate utilization of deep learning models. The results of our review show that: 1) most deep learning models outperform classic machine learning and statistical algorithms in credit risk estimation, and 2) ensemble methods provide higher accuracy compared with single models. Finally, we present summary tables in terms of datasets and proposed models

    Cost-sensitive ensemble learning: a unifying framework

    Get PDF
    Over the years, a plethora of cost-sensitive methods have been proposed for learning on data when different types of misclassification errors incur different costs. Our contribution is a unifying framework that provides a comprehensive and insightful overview on cost-sensitive ensemble methods, pinpointing their differences and similarities via a fine-grained categorization. Our framework contains natural extensions and generalisations of ideas across methods, be it AdaBoost, Bagging or Random Forest, and as a result not only yields all methods known to date but also some not previously considered.publishedVersio

    Äriprotsesside ajaliste näitajate selgitatav ennustav jälgimine

    Get PDF
    Kaasaegsed ettevõtte infosüsteemid võimaldavad ettevõtetel koguda detailset informatsiooni äriprotsesside täitmiste kohta. Eelnev koos masinõppe meetoditega võimaldab kasutada andmejuhitavaid ja ennustatavaid lähenemisi äriprotsesside jõudluse jälgimiseks. Kasutades ennustuslike äriprotsesside jälgimise tehnikaid on võimalik jõudluse probleeme ennustada ning soovimatu tegurite mõju ennetavalt leevendada. Tüüpilised küsimused, millega tegeleb ennustuslik protsesside jälgimine on “millal antud äriprotsess lõppeb?” või “mis on kõige tõenäolisem järgmine sündmus antud äriprotsessi jaoks?”. Suurim osa olemasolevatest lahendustest eelistavad täpsust selgitatavusele. Praktikas, selgitatavus on ennustatavate tehnikate tähtis tunnus. Ennustused, kas protsessi täitmine ebaõnnestub või selle täitmisel võivad tekkida raskused, pole piisavad. On oluline kasutajatele seletada, kuidas on selline ennustuse tulemus saavutatud ning mida saab teha soovimatu tulemuse ennetamiseks. Töö pakub välja kaks meetodit ennustatavate mudelite konstrueerimiseks, mis võimaldavad jälgida äriprotsesse ning keskenduvad selgitatavusel. Seda saavutatakse ennustuse lahtivõtmisega elementaarosadeks. Näiteks, kui ennustatakse, et äriprotsessi lõpuni on jäänud aega 20 tundi, siis saame anda seletust, et see aeg on moodustatud kõikide seni käsitlemata tegevuste lõpetamiseks vajalikust ajast. Töös võrreldakse omavahel eelmainitud meetodeid, käsitledes äriprotsesse erinevatest valdkondadest. Hindamine toob esile erinevusi selgitatava ja täpsusele põhinevale lähenemiste vahel. Töö teaduslik panus on ennustuslikuks protsesside jälgimiseks vabavaralise tööriista arendamine. Süsteemi nimeks on Nirdizati ning see süsteem võimaldab treenida ennustuslike masinõppe mudeleid, kasutades nii töös kirjeldatud meetodeid kui ka kolmanda osapoole meetodeid. Hiljem saab treenitud mudeleid kasutada hetkel käivate äriprotsesside tulemuste ennustamiseks, mis saab aidata kasutajaid reaalajas.Modern enterprise systems collect detailed data about the execution of the business processes they support. The widespread availability of such data in companies, coupled with advances in machine learning, have led to the emergence of data-driven and predictive approaches to monitor the performance of business processes. By using such predictive process monitoring approaches, potential performance issues can be anticipated and proactively mitigated. Various approaches have been proposed to address typical predictive process monitoring questions, such as what is the most likely continuation of an ongoing process instance, or when it will finish. However, most existing approaches prioritize accuracy over explainability. Yet in practice, explainability is a critical property of predictive methods. It is not enough to accurately predict that a running process instance will end up in an undesired outcome. It is also important for users to understand why this prediction is made and what can be done to prevent this undesired outcome. This thesis proposes two methods to build predictive models to monitor business processes in an explainable manner. This is achieved by decomposing a prediction into its elementary components. For example, to explain that the remaining execution time of a process execution is predicted to be 20 hours, we decompose this prediction into the predicted execution time of each activity that has not yet been executed. We evaluate the proposed methods against each other and various state-of-the-art baselines using a range of business processes from multiple domains. The evaluation reaffirms a fundamental trade-off between explainability and accuracy of predictions. The research contributions of the thesis have been consolidated into an open-source tool for predictive business process monitoring, namely Nirdizati. It can be used to train predictive models using the methods described in this thesis, as well as third-party methods. These models are then used to make predictions for ongoing process instances; thus, the tool can also support users at runtime

    Data Mining

    Get PDF
    Data mining is a branch of computer science that is used to automatically extract meaningful, useful knowledge and previously unknown, hidden, interesting patterns from a large amount of data to support the decision-making process. This book presents recent theoretical and practical advances in the field of data mining. It discusses a number of data mining methods, including classification, clustering, and association rule mining. This book brings together many different successful data mining studies in various areas such as health, banking, education, software engineering, animal science, and the environment

    Cost-sensitive deep neural network ensemble for class imbalance problem

    Full text link
    In data mining, classification is a task to build a model which classifies data into a given set of categories. Most classification algorithms assume the class distribution of data to be roughly balanced. In real-life applications such as direct marketing, fraud detection and churn prediction, class imbalance problem usually occurs. Class imbalance problem is referred to the issue that the number of examples belonging to a class is significantly greater than those of the others. When training a standard classifier with class imbalance data, the classifier is usually biased toward majority class. However, minority class is the class of interest and more significant than the majority class. In the literature, existing methods such as data-level, algorithmic-level and cost-sensitive learning have been proposed to address this problem. The experiments discussed in these studies were usually conducted on relatively small data sets or even on artificial data. The performance of the methods on modern real-life data sets, which are more complicated, is unclear. In this research, we study the background and some of the state-of-the-art approaches which handle class imbalance problem. We also propose two costsensitive methods to address class imbalance problem, namely Cost-Sensitive Deep Neural Network (CSDNN) and Cost-Sensitive Deep Neural Network Ensemble (CSDE). CSDNN is a deep neural network based on Stacked Denoising Autoencoders (SDAE). We propose CSDNN by incorporating cost information of majority and minority class into the cost function of SDAE to make it costsensitive. Another proposed method, CSDE, is an ensemble learning version of CSDNN which is proposed to improve the generalization performance on class imbalance problem. In the first step, a deep neural network based on SDAE is created for layer-wise feature extraction. Next, we perform Bagging’s resampling procedure with undersampling to split training data into a number of bootstrap samples. In the third step, we apply a layer-wise feature extraction method to extract new feature samples from each of the hidden layer(s) of the SDAE. Lastly, the ensemble learning is performed by using each of the new feature samples to train a CSDNN classifier with random cost vector. Experiments are conducted to compare the proposed methods with the existing methods. We examine their performance on real-life data sets in business domains. The results show that the proposed methods obtain promising results in handling class imbalance problem and also outperform all the other compared methods. There are three major contributions to this work. First, we proposed CSDNN method in which misclassification costs are considered in training process. Second, we incorporate random undersampling with layer-wise feature extraction to perform ensemble learning. Third, this is the first work that conducts experiments on class imbalance problem using large real-life data sets in different business domains ranging from direct marketing, churn prediction, credit scoring, fraud detection to fake review detection

    Incremental Learning Method for Data with Delayed Labels

    Get PDF
    Most research on machine learning tasks relies on the availability of true labels immediately after making a prediction. However, in many cases, the ground truth labels become available with a non-negligible delay. In general, delayed labels create two problems. First, labelled data is insufficient because the label for each data chunk will be obtained multiple times. Second, there remains a problem of concept drift due to the long period of data. In this work, we propose a novel incremental ensemble learning when delayed labels occur. First, we build a sliding time window to preserve the historical data. Then we train an adaptive classifier by labelled data in the sliding time window. It is worth noting that we improve the TrAdaBoost to expand the data of the latest moment when building an adaptive classifier. It can correctly distinguish the wrong types of source domain sample classification. Finally, we integrate the various classifiers to make predictions. We apply our algorithms to synthetic and real credit scoring datasets. The experiment results indicate our algorithms have superiority in delayed labelling setting

    Essentials of Business Analytics

    Get PDF

    Improving decision tree and neural network learning for evolving data-streams

    Get PDF
    High-throughput real-time Big Data stream processing requires fast incremental algorithms that keep models consistent with most recent data. In this scenario, Hoeffding Trees are considered the state-of-the-art single classifier for processing data streams and they are widely used in ensemble combinations. This thesis is devoted to the improvement of the performance of algorithms for machine learning/artificial intelligence on evolving data streams. In particular, we focus on improving the Hoeffding Tree classifier and its ensemble combinations, in order to reduce its resource consumption and its response time latency, achieving better throughput when processing evolving data streams. First, this thesis presents a study on using Neural Networks (NN) as an alternative method for processing data streams. The use of random features for improving NNs training speed is proposed and important issues are highlighted about the use of NN on a data stream setup. These issues motivated this thesis to go in the direction of improving the current state-of-the-art methods: Hoeffding Trees and their ensemble combinations. Second, this thesis proposes the Echo State Hoeffding Tree (ESHT), as an extension of the Hoeffding Tree to model time-dependencies typically present in data streams. The capabilities of the new proposed architecture on both regression and classification problems are evaluated. Third, a new methodology to improve the Adaptive Random Forest (ARF) is developed. ARF has been introduced recently, and it is considered the state-of-the-art classifier in the MOA framework (a popular framework for processing evolving data streams). This thesis proposes the Elastic Swap Random Forest, an extension to ARF that reduces the number of base learners in the ensemble down to one third on average, while providing similar accuracy than the standard ARF with 100 trees. And finally, a last contribution on a multi-threaded high performance scalable ensemble design that is highly adaptable to a variety of hardware platforms, ranging from server-class to edge computing. The proposed design achieves throughput improvements of 85x (Intel i7), 143x (Intel Xeon parsing from memory), 10x (Jetson TX1, ARM) and 23x (X-Gene2, ARM) compared to single-threaded MOA on i7. In addition, the proposal achieves 75% parallel efficiency when using 24 cores on the Intel Xeon.Procesar grandes flujos de datos (Big Data Streams, BDS) en tiempo real requiere el uso de algoritmos incrementales rápidos que mantengan los modelos consistentes con los datos más recientes. En este escenario, los Hoeffding Trees (HT) se consideran el clasificador simple más avanzado para procesar BDS, razon por la cual son ampliamente usados como base a la hora de combinar clasificadores en Ensembles. Esta tesis está dedicada a la mejora del rendimiento de algoritmos para Machine Learning/Iteligencia Artificial en BDS que evolucionan con el tiempo (es decir, BDS cuya distribución estadística cambia con el tiempo). En particular, nuestro objetivo es mejorar el Hoeffding Tree y sus combinaciones en Ensembles, con el objetivo de reducir el consumo de recursos y la latencia en el tiempo de respuesta, logrando un mejor rendimiento al procesar BDS que evolucionan en el tiempo. Primero, se presenta un estudio sobre el uso de redes neuronales (NN) con parámetros aleatorios como un método alternativo para procesar BDS con el objetivo de mejorar la velocidad de entrenamiento de Nns. También se destacan problemas importantes derivados del uso de NN para BDS. Como consecuencia, esta tesis tomo la dirección de mejorar los métodos de vanguardia en BDS: Hoeffding Trees y sus combinaciones en Ensembles. Segundo, se propone el Echo State Hoeffding Tree (ESHT), como una extensión del HT para modelar las dependencias temporales típicamente presentes en BDS. La nueva arquitectura propuesta se evalúa tanto en problemas de regresión como de clasificación. Tercero, se propone una extensión para el Adaptive Random Forest (ARF), publicado recientemente y considerado como el clasificador mas potente implementado en MOA (un framework muy popular para procesar BDS). Proponemos el Elastic Swap Random Forest para reducir el número de clasificadores en el ensemble a un tercio en promedio, al tiempo se mantiene un accuracy similar a la de un ARF estándar con 100 árboles. Finalmente, la última contribución de esta tesis es una arquitectura de Ensembles multi hilo para procesar BDS. Nuestro diseño es altamente adaptable a una variedad de plataformas de hardware, que van desde servidores hasta pequeños dispositivos en el Edge Computing (pej, Internet de las Cosas). El diseño propuesto logra mejoras de rendimiento de 85x (Intel i7), 143x (análisis de Intel Xeon desde la memoria), 10x (Jetson TX1, ARM) y 23x (X-Gene2, ARM) en comparación con MOA (un solo proceso) en un Intel i7. Además, la propuesta logra una eficiencia paralela del 75 \% cuando se usan 24 núcleos en el Intel Xeon.Postprint (published version
    corecore