1,382 research outputs found

    Will it fail and why? A large case study of company default prediction with highly interpretable machine learning models

    Get PDF
    Finding a model to predict the default of a firm is a well-known topic over the financial and data science community. Default prediction problem has been studied for over fifty years, but remain a very hard task even today. Since it maintains a remarkable practical relevance, we try to put in practice our efforts in order to obtain the maximum rediction results, also in comparison with the reference literature. In our work we use in combination three large and important datasets in order to investigate both bankruptcy and bank default: a state of difficulty for companies that often anticipates actual bankruptcy. We combine one dataset from the Italian Central Credit Register of the Bank of Italy, one from balance sheet information related to Italian firms, and information from AnaCredit dataset, a novel source of credit data by European Central Bank. We try to go beyond the academic study and to show how our model, based on some promising machine learning algorithms, outperforms the current default predictions made by credit institutions. At the same time, we try to provide insights on the reasons that lead to a particular outcome. In fact, many modern approaches try to find well-performing models to forecast the default of a company; those models often act like a black-box and don’t give to financial institutions the fundamental explanations they need for their choices. This project aims to find a robust predictive model using a tree-based machine learning algorithm which flanked by a game-theoretic approach can provide sound explanations of the output of the model. Finally, we dedicated a special effort to the analysis of predictions in highly unbalanced contexts. Imbalanced classes are a common problem in machine learning classification that typically is addressed by removing the imbalance in the training set. We conjecture that it is not always the best choice and propose the use of a slightly unbalanced training set, showing that this approach contributes to maximize the performance

    Predicting Credit Ratings using Deep Learning Models – An Analysis of the Indian IT Industry

    Get PDF
    Due to the complexity of transactions and the availability of Big Data, many banks and financial institutions are reviewing their business models. Various tasks get involved in determining the credit worthiness like working with spreadsheets, manually gathering data from customers and corporations, etc. In this research paper, we aim to automate and analyze the credit ratings of the Information and technology industry in India. Various Deep-Learning models are incorporated to predict the credit rankings from highest to lowest separately for each company to find the best fit model. Factors like Share Capital, Depreciation & Amortisation, Intangible Assets, Operating Margin, inventory valuation, etc., are the parameters that contribute to the credit rating predictions. The data collected for the study spans between the years FY-2015 to FY-2020. As per the research been carried out with efficiencies of different Deep Learning models been tested and compared, MLP gained the highest efficiency for predicting the same. This research contributes to identifying how we can predict the ratings for several IT companies in India based on their Financial risk, Business risk, Industrial risk, and Macroeconomic environment using various neural network models for better accuracy. Also it helps us understand the significance of Artificial Neural Networks in credit rating predictions using unstructured and real time Financial data consisting the influence of COVID-19 in Indian IT industry

    DetecciĂłn de discurso de odio online utilizando Machine Learning

    Get PDF
    Trabajo de Fin de Grado en Ingeniería informática, Facultad de Informática UCM, Departamento de Ingeniería del Software e Inteligencia Artificial, Curso 2021/2022. Enlace al repositorio público del proyecto: https://github.com/NILGroup/TFG-2122HateSpeechDetectionHate speech directed towards marginalized people is a very common problem online, especially in social media such as Twitter or Reddit. Automatically detecting hate speech in such spaces can help mend the Internet and transform it into a safer environment for everybody. Hate speech detection fits into text classification, a series of tasks where text is organized into categories. This project2 proposes using Machine Learning algorithms to detect hate speech in online text in four languages: English, Spanish, Italian and Portuguese. The data to train the models was obtained from online, publicly available datasets. Three different algorithms with varying parameters have been used in order to compare their performance. The experiments show that the best results reach an 82.51% accuracy and around an 83% F1-score, for Italian text. Each language has different results depending on distinct factors.El discurso de odio dirigido a personas marginadas es un problema muy común en línea, especialmente en redes sociales como Twitter o Reddit. La detección automática del discurso de odio en dichos espacios puede ayudar a reparar Internet y a transformarlo en un entorno más seguro para todos. La detección del discurso de odio encaja en la clasificación de texto, donde se organiza en categorías. Este proyecto1 propone el uso de algoritmos de Machine Learning para localizar discurso de odio en textos online en cuatro idiomas: inglés, español, italiano y portugués. Los datos para entrenar los modelos se obtuvieron de datasets disponibles públicamente en línea. Se han utilizado tres algoritmos diferentes con distintos parámetros para comparar su rendimiento. Los experimentos muestran que los mejores resultados alcanzan una precisión del 82,51 % y un valor F1 de alrededor del 83 % en italiano. Los resultados para cada idioma varían dependiendo de distintos factores.Depto. de Ingeniería de Software e Inteligencia Artificial (ISIA)Fac. de InformáticaTRUEunpu

    Prediction of activity and selectivity profiles of human Carbonic Anhydrase inhibitors using machine learning classification models

    Get PDF
    The development of selective inhibitors of the clinically relevant human Carbonic Anhydrase (hCA) isoforms IX and XII has become a major topic in drug research, due to their deregulation in several types of cancer. Indeed, the selective inhibition of these two isoforms, especially with respect to the homeostatic isoform II, holds great promise to develop anticancer drugs with limited side effects. Therefore, the development of in silico models able to predict the activity and selectivity against the desired isoform(s) is of central interest. In this work, we have developed a series of machine learning classification models, trained on high confidence data extracted from ChEMBL, able to predict the activity and selectivity profiles of ligands for human Carbonic Anhydrase isoforms II, IX and XII. The training datasets were built with a procedure that made use of flexible bioactivity thresholds to obtain well-balanced active and inactive classes. We used multiple algorithms and sampling sizes to finally select activity models able to classify active or inactive molecules with excellent performances. Remarkably, the results herein reported turned out to be better than those obtained by models built with the classic approach of selecting an a priori activity threshold. The sequential application of such validated models enables virtual screening to be performed in a fast and more reliable way to predict the activity and selectivity profiles against the investigated isoforms

    New PCA-based Category Encoder for Cybersecurity and Processing Data in IoT Devices

    Full text link
    Increasing the cardinality of categorical variables might decrease the overall performance of machine learning (ML) algorithms. This paper presents a novel computational preprocessing method to convert categorical to numerical variables ML algorithms. It uses a supervised binary classifier to extract additional context-related features from the categorical values. Up to two numerical variables per categorical variable are created, depending on the compression achieved by the Principal Component Analysis (PCA). The method requires two hyperparameters: a threshold related to the distribution of categories in the variables and the PCA representativeness. This paper applies the proposed approach to the well-known cybersecurity NSLKDD dataset to select and convert three categorical features to numerical features. After choosing the threshold parameter, we use conditional probabilities to convert the three categorical variables into six new numerical variables. After that, we feed these numerical variables to the PCA algorithm and select the whole or partial numbers of the Principal Components (PCs). Finally, by applying binary classification with ten different classifiers, we measure the performance of the new encoder and compare it with the other 17 well-known category encoders. The new technique achieves the highest performance related to accuracy and Area Under the Curve (AUC) on high cardinality categorical variables. Also, we define the harmonic average metrics to find the best trade-off between train and test performances and prevent underfitting and overfitting. Ultimately, the number of newly created numerical variables is minimal. This data reduction improves computational processing time in Internet of things (IoT) devices in future telecommunication networks.Comment: 6 pages, 4 figures, 5 table

    Efficient embedded sleep wake classification for open-source actigraphy

    Get PDF
    This study presents a thorough analysis of sleep/wake detection algorithms for efficient on-device sleep tracking using wearable accelerometric devices. It develops a novel end-to-end algorithm using convolutional neural network applied to raw accelerometric signals recorded by an open-source wrist-worn actigraph. The aim of the study is to develop an automatic classifier that: (1) is highly generalizable to heterogenous subjects, (2) would not require manual features’ extraction, (3) is computationally lightweight, embeddable on a sleep tracking device, and (4) is suitable for a wide assortment of actigraphs. Hereby, authors analyze sleep parameters, such as total sleep time, waking after sleep onset and sleep efficiency, by comparing the outcomes of the proposed algorithm to the gold standard polysomnographic concurrent recordings. The relatively substantial agreement (Cohen’s kappa coefficient, median, equal to 0.78 ± 0.07) and the low-computational cost (2727 floating-point operations) make this solution suitable for an on-board sleep-detection approach

    Explainable AI for Interpretable Credit Scoring

    Full text link
    With the ever-growing achievements in Artificial Intelligence (AI) and the recent boosted enthusiasm in Financial Technology (FinTech), applications such as credit scoring have gained substantial academic interest. Credit scoring helps financial experts make better decisions regarding whether or not to accept a loan application, such that loans with a high probability of default are not accepted. Apart from the noisy and highly imbalanced data challenges faced by such credit scoring models, recent regulations such as the `right to explanation' introduced by the General Data Protection Regulation (GDPR) and the Equal Credit Opportunity Act (ECOA) have added the need for model interpretability to ensure that algorithmic decisions are understandable and coherent. An interesting concept that has been recently introduced is eXplainable AI (XAI), which focuses on making black-box models more interpretable. In this work, we present a credit scoring model that is both accurate and interpretable. For classification, state-of-the-art performance on the Home Equity Line of Credit (HELOC) and Lending Club (LC) Datasets is achieved using the Extreme Gradient Boosting (XGBoost) model. The model is then further enhanced with a 360-degree explanation framework, which provides different explanations (i.e. global, local feature-based and local instance-based) that are required by different people in different situations. Evaluation through the use of functionallygrounded, application-grounded and human-grounded analysis show that the explanations provided are simple, consistent as well as satisfy the six predetermined hypotheses testing for correctness, effectiveness, easy understanding, detail sufficiency and trustworthiness.Comment: 19 pages, David C. Wyld et al. (Eds): ACITY, DPPR, VLSI, WeST, DSA, CNDC, IoTE, AIAA, NLPTA - 202
    • …
    corecore