1,382 research outputs found
Will it fail and why? A large case study of company default prediction with highly interpretable machine learning models
Finding a model to predict the default of a firm is a well-known topic over the financial and data science community. Default prediction problem has been studied for over fifty years, but remain a very hard task even today. Since it maintains a remarkable practical relevance, we try to put in practice our efforts in order to obtain the maximum rediction results, also in comparison with the reference literature. In our work we use in combination three large and important datasets in order to investigate both bankruptcy and bank default: a state of difficulty for companies that often anticipates actual bankruptcy. We combine one dataset from the Italian Central Credit Register of the Bank of Italy, one from balance sheet information related to Italian firms, and information from AnaCredit dataset, a novel source of credit data by European Central Bank. We try to go beyond the academic study and to show how our model, based on some promising machine learning algorithms, outperforms the current default predictions made by credit institutions. At the same time, we try to provide insights on the reasons that lead to a particular outcome. In fact, many modern approaches try to find well-performing models to forecast the default of a company; those models often act like a black-box and don’t give to financial institutions the fundamental explanations they need for their choices. This project aims to find a robust predictive model using a tree-based machine learning algorithm which flanked by a game-theoretic approach can provide sound explanations of the output of the model. Finally, we dedicated a special effort to the analysis of predictions in highly unbalanced contexts. Imbalanced classes are a common problem in machine learning classification that typically is addressed by removing the imbalance in the training set. We conjecture that it is not always the best choice and propose the use of a slightly unbalanced training set, showing that this approach contributes to maximize the performance
Predicting Credit Ratings using Deep Learning Models – An Analysis of the Indian IT Industry
Due to the complexity of transactions and the availability of Big Data, many banks and financial institutions are reviewing their business models. Various tasks get involved in determining the credit worthiness like working with spreadsheets, manually gathering data from customers and corporations, etc. In this research paper, we aim to automate and analyze the credit ratings of the Information and technology industry in India. Various Deep-Learning models are incorporated to predict the credit rankings from highest to lowest separately for each company to find the best fit model. Factors like Share Capital, Depreciation & Amortisation, Intangible Assets, Operating Margin, inventory valuation, etc., are the parameters that contribute to the credit rating predictions. The data collected for the study spans between the years FY-2015 to FY-2020. As per the research been carried out with efficiencies of different Deep Learning models been tested and compared, MLP gained the highest efficiency for predicting the same. This research contributes to identifying how we can predict the ratings for several IT companies in India based on their Financial risk, Business risk, Industrial risk, and Macroeconomic environment using various neural network models for better accuracy. Also it helps us understand the significance of Artificial Neural Networks in credit rating predictions using unstructured and real time Financial data consisting the influence of COVID-19 in Indian IT industry
DetecciĂłn de discurso de odio online utilizando Machine Learning
Trabajo de Fin de Grado en IngenierĂa informática, Facultad de Informática UCM, Departamento de IngenierĂa del Software e Inteligencia Artificial, Curso 2021/2022. Enlace al repositorio pĂşblico del proyecto: https://github.com/NILGroup/TFG-2122HateSpeechDetectionHate speech directed towards marginalized people is a very common problem online, especially in social media such as Twitter or Reddit. Automatically detecting hate speech in such spaces can help mend the Internet and transform it into a safer environment for everybody. Hate speech detection fits into text classification, a series of tasks where text is organized into categories. This project2 proposes using Machine Learning algorithms to detect hate speech in online text in four languages: English, Spanish, Italian and Portuguese. The data to train the models was obtained from online, publicly available datasets. Three different algorithms with varying parameters have been used in order to compare their performance. The experiments show that the best results reach an 82.51% accuracy and around an 83% F1-score, for Italian text. Each language has different results depending on distinct factors.El discurso de odio dirigido a personas marginadas es un problema muy comĂşn en lĂnea, especialmente en redes sociales como Twitter o Reddit. La detecciĂłn automática del discurso de odio en dichos espacios puede ayudar a reparar Internet y a transformarlo en un entorno más seguro para todos. La detecciĂłn del discurso de odio encaja en la clasificaciĂłn de texto, donde se organiza en categorĂas. Este proyecto1 propone el uso de algoritmos de Machine Learning para localizar discurso de odio en textos online en cuatro idiomas: inglĂ©s, español, italiano y portuguĂ©s. Los datos para entrenar los modelos se obtuvieron de datasets disponibles pĂşblicamente en lĂnea. Se han utilizado tres algoritmos diferentes con distintos parámetros para comparar su rendimiento. Los experimentos muestran que los mejores resultados alcanzan una precisiĂłn del 82,51 % y un valor F1 de alrededor del 83 % en italiano. Los resultados para cada idioma varĂan dependiendo de distintos factores.Depto. de IngenierĂa de Software e Inteligencia Artificial (ISIA)Fac. de InformáticaTRUEunpu
Prediction of activity and selectivity profiles of human Carbonic Anhydrase inhibitors using machine learning classification models
The development of selective inhibitors of the clinically relevant human Carbonic Anhydrase (hCA) isoforms IX and XII has become a major topic in drug research, due to their deregulation in several types of cancer. Indeed, the selective inhibition of these two isoforms, especially with respect to the homeostatic isoform II, holds great promise to develop anticancer drugs with limited side effects. Therefore, the development of in silico models able to predict the activity and selectivity against the desired isoform(s) is of central interest. In this work, we have developed a series of machine learning classification models, trained on high confidence data extracted from ChEMBL, able to predict the activity and selectivity profiles of ligands for human Carbonic Anhydrase isoforms II, IX and XII. The training datasets were built with a procedure that made use of flexible bioactivity thresholds to obtain well-balanced active and inactive classes. We used multiple algorithms and sampling sizes to finally select activity models able to classify active or inactive molecules with excellent performances. Remarkably, the results herein reported turned out to be better than those obtained by models built with the classic approach of selecting an a priori activity threshold. The sequential application of such validated models enables virtual screening to be performed in a fast and more reliable way to predict the activity and selectivity profiles against the investigated isoforms
Recommended from our members
Predictive Modelling for Loan Defaults
In this paper we explore how predictive modelling can be applied in loan default prediction. The issue of predicting the outcome of a loan to be fully paid or defaulted is one of binary classification. We explore the use of different machine learning models and their performance, namely, logistic regression, random forest, neural network, extreme gradient boost and ensemble. Additionally, as is the case with many industry data, class imbalance is an issue and data as cannot be used as such in a model otherwise the model will suffer from bias. In order to solve this issue, we explore the use of sampling techniques, such as SMOTE and ADASYN, and cost sensitive learning techniques, such as class weights. Finally, using precision, recall, G-mean, and F-measure as well as precision and recall curve AUC to examine the results of each model, it was found that there is no balancing method that is consistently superior. While all models performed well after applying a balancing method, the XGBoost with class weights model performed the best. With a robust model, there are potential opportunities for it to be leveraged in optimizing profits to produce a greater return on investment. Using the best model, return on investment was able to be improved by 83%
New PCA-based Category Encoder for Cybersecurity and Processing Data in IoT Devices
Increasing the cardinality of categorical variables might decrease the
overall performance of machine learning (ML) algorithms. This paper presents a
novel computational preprocessing method to convert categorical to numerical
variables ML algorithms. It uses a supervised binary classifier to extract
additional context-related features from the categorical values. Up to two
numerical variables per categorical variable are created, depending on the
compression achieved by the Principal Component Analysis (PCA). The method
requires two hyperparameters: a threshold related to the distribution of
categories in the variables and the PCA representativeness. This paper applies
the proposed approach to the well-known cybersecurity NSLKDD dataset to select
and convert three categorical features to numerical features. After choosing
the threshold parameter, we use conditional probabilities to convert the three
categorical variables into six new numerical variables. After that, we feed
these numerical variables to the PCA algorithm and select the whole or partial
numbers of the Principal Components (PCs). Finally, by applying binary
classification with ten different classifiers, we measure the performance of
the new encoder and compare it with the other 17 well-known category encoders.
The new technique achieves the highest performance related to accuracy and Area
Under the Curve (AUC) on high cardinality categorical variables. Also, we
define the harmonic average metrics to find the best trade-off between train
and test performances and prevent underfitting and overfitting. Ultimately, the
number of newly created numerical variables is minimal. This data reduction
improves computational processing time in Internet of things (IoT) devices in
future telecommunication networks.Comment: 6 pages, 4 figures, 5 table
Efficient embedded sleep wake classification for open-source actigraphy
This study presents a thorough analysis of sleep/wake detection algorithms for efficient on-device sleep tracking using wearable accelerometric devices. It develops a novel end-to-end algorithm using convolutional neural network applied to raw accelerometric signals recorded by an open-source wrist-worn actigraph. The aim of the study is to develop an automatic classifier that: (1) is highly generalizable to heterogenous subjects, (2) would not require manual features’ extraction, (3) is computationally lightweight, embeddable on a sleep tracking device, and (4) is suitable for a wide assortment of actigraphs. Hereby, authors analyze sleep parameters, such as total sleep time, waking after sleep onset and sleep efficiency, by comparing the outcomes of the proposed algorithm to the gold standard polysomnographic concurrent recordings. The relatively substantial agreement (Cohen’s kappa coefficient, median, equal to 0.78 ± 0.07) and the low-computational cost (2727 floating-point operations) make this solution suitable for an on-board sleep-detection approach
Explainable AI for Interpretable Credit Scoring
With the ever-growing achievements in Artificial Intelligence (AI) and the
recent boosted enthusiasm in Financial Technology (FinTech), applications such
as credit scoring have gained substantial academic interest. Credit scoring
helps financial experts make better decisions regarding whether or not to
accept a loan application, such that loans with a high probability of default
are not accepted. Apart from the noisy and highly imbalanced data challenges
faced by such credit scoring models, recent regulations such as the `right to
explanation' introduced by the General Data Protection Regulation (GDPR) and
the Equal Credit Opportunity Act (ECOA) have added the need for model
interpretability to ensure that algorithmic decisions are understandable and
coherent. An interesting concept that has been recently introduced is
eXplainable AI (XAI), which focuses on making black-box models more
interpretable. In this work, we present a credit scoring model that is both
accurate and interpretable. For classification, state-of-the-art performance on
the Home Equity Line of Credit (HELOC) and Lending Club (LC) Datasets is
achieved using the Extreme Gradient Boosting (XGBoost) model. The model is then
further enhanced with a 360-degree explanation framework, which provides
different explanations (i.e. global, local feature-based and local
instance-based) that are required by different people in different situations.
Evaluation through the use of functionallygrounded, application-grounded and
human-grounded analysis show that the explanations provided are simple,
consistent as well as satisfy the six predetermined hypotheses testing for
correctness, effectiveness, easy understanding, detail sufficiency and
trustworthiness.Comment: 19 pages, David C. Wyld et al. (Eds): ACITY, DPPR, VLSI, WeST, DSA,
CNDC, IoTE, AIAA, NLPTA - 202
- …