1,320 research outputs found

    Visualizing Deep Networks by Optimizing with Integrated Gradients

    Full text link
    Understanding and interpreting the decisions made by deep learning models is valuable in many domains. In computer vision, computing heatmaps from a deep network is a popular approach for visualizing and understanding deep networks. However, heatmaps that do not correlate with the network may mislead human, hence the performance of heatmaps in providing a faithful explanation to the underlying deep network is crucial. In this paper, we propose I-GOS, which optimizes for a heatmap so that the classification scores on the masked image would maximally decrease. The main novelty of the approach is to compute descent directions based on the integrated gradients instead of the normal gradient, which avoids local optima and speeds up convergence. Compared with previous approaches, our method can flexibly compute heatmaps at any resolution for different user needs. Extensive experiments on several benchmark datasets show that the heatmaps produced by our approach are more correlated with the decision of the underlying deep network, in comparison with other state-of-the-art approaches

    Boosting insights in insurance tariff plans with tree-based machine learning methods

    Full text link
    Pricing actuaries typically operate within the framework of generalized linear models (GLMs). With the upswing of data analytics, our study puts focus on machine learning methods to develop full tariff plans built from both the frequency and severity of claims. We adapt the loss functions used in the algorithms such that the specific characteristics of insurance data are carefully incorporated: highly unbalanced count data with excess zeros and varying exposure on the frequency side combined with scarce, but potentially long-tailed data on the severity side. A key requirement is the need for transparent and interpretable pricing models which are easily explainable to all stakeholders. We therefore focus on machine learning with decision trees: starting from simple regression trees, we work towards more advanced ensembles such as random forests and boosted trees. We show how to choose the optimal tuning parameters for these models in an elaborate cross-validation scheme, we present visualization tools to obtain insights from the resulting models and the economic value of these new modeling approaches is evaluated. Boosted trees outperform the classical GLMs, allowing the insurer to form profitable portfolios and to guard against potential adverse risk selection

    Comment: Boosting Algorithms: Regularization, Prediction and Model Fitting

    Get PDF
    The authors are doing the readers of Statistical Science a true service with a well-written and up-to-date overview of boosting that originated with the seminal algorithms of Freund and Schapire. Equally, we are grateful for high-level software that will permit a larger readership to experiment with, or simply apply, boosting-inspired model fitting. The authors show us a world of methodology that illustrates how a fundamental innovation can penetrate every nook and cranny of statistical thinking and practice. They introduce the reader to one particular interpretation of boosting and then give a display of its potential with extensions from classification (where it all started) to least squares, exponential family models, survival analysis, to base-learners other than trees such as smoothing splines, to degrees of freedom and regularization, and to fascinating recent work in model selection. The uninitiated reader will find that the authors did a nice job of presenting a certain coherent and useful interpretation of boosting. The other reader, though, who has watched the business of boosting for a while, may have quibbles with the authors over details of the historic record and, more importantly, over their optimism about the current state of theoretical knowledge. In fact, as much as the statistical view has proven fruitful, it has also resulted in some ideas about why boosting works that may be misconceived, and in some recommendations that may be misguided

    "Influence Sketching": Finding Influential Samples In Large-Scale Regressions

    Full text link
    There is an especially strong need in modern large-scale data analysis to prioritize samples for manual inspection. For example, the inspection could target important mislabeled samples or key vulnerabilities exploitable by an adversarial attack. In order to solve the "needle in the haystack" problem of which samples to inspect, we develop a new scalable version of Cook's distance, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions). In order to scale this technique up to very large and high-dimensional datasets, we introduce a new algorithm which we call "influence sketching." Influence sketching embeds random projections within the influence computation; in particular, the influence score is calculated using the randomly projected pseudo-dataset from the post-convergence Generalized Linear Model (GLM). We validate that influence sketching can reliably and successfully discover influential samples by applying the technique to a malware detection dataset of over 2 million executable files, each represented with almost 100,000 features. For example, we find that randomly deleting approximately 10% of training samples reduces predictive accuracy only slightly from 99.47% to 99.45%, whereas deleting the same number of samples with high influence sketch scores reduces predictive accuracy all the way down to 90.24%. Moreover, we find that influential samples are especially likely to be mislabeled. In the case study, we manually inspect the most influential samples, and find that influence sketching pointed us to new, previously unidentified pieces of malware.Comment: fixed additional typo

    Ennustemallin kehittäminen suomalaisten PK-yritysten konkurssiriskin määritykseen

    Get PDF
    Bankruptcy prediction is a subject of significant interest to both academics and practitioners because of its vast economic and societal impact. Academic research in the field is extensive and diverse; no consensus has formed regarding the superiority of different prediction methods or predictor variables. Most studies focus on large companies; small and medium-sized enterprises (SMEs) have received less attention, mainly due to data unavailability. Despite recent academic advances, simple statistical models are still favored in practical use, largely due to their understandability and interpretability. This study aims to construct a high-performing but user-friendly and interpretable bankruptcy prediction model for Finnish SMEs using financial statement data from 2008–2010. A literature review is conducted to explore the key aspects of bankruptcy prediction; the findings are used for designing an empirical study. Five prediction models are trained on different predictor subsets and training samples, and two models are chosen for detailed examination based on the findings. A prediction model using the random forest method, utilizing all available predictors and the unadjusted training data containing an imbalance of bankrupt and non-bankrupt firms, is found to perform best. Superior performance compared to a benchmark model is observed in terms of both key metrics, and the random forest model is deemed easy to use and interpretable; it is therefore recommended for practical application. Equity ratio and financial expenses to total assets consistently rank as the best two predictors for different models; otherwise the findings on predictor importance are mixed, but mainly in line with the prevalent views in the related literature. This study shows that constructing an accurate but practical bankruptcy prediction model is feasible, and serves as a guideline for future scholars and practitioners seeking to achieve the same. Some further research avenues to follow are recognized based on empirical findings and the extant literature. In particular, this study raises an important question regarding the appropriateness of the most commonly used performance metrics in bankruptcy prediction. Area under the precision-recall curve (PR AUC), which is widely used in other fields of study, is deemed a suitable alternative and is recommended for measuring model performance in future bankruptcy prediction studies.Konkurssien ennustaminen on taloudellisten ja yhteiskunnallisten vaikutustensa vuoksi merkittävä aihe akateemisesta ja käytännöllisestä näkökulmasta. Alan tutkimus on laajaa ja monipuolista, eikä konsensusta parhaiden ennustemallien ja -muuttujien suhteen ole saavutettu. Valtaosa tutkimuksista keskittyy suuryrityksiin; pienten ja keskisuurten (PK)-yritysten konkurssimallinnus on jäänyt vähemmälle huomiolle. Akateemisen tutkimuksen viimeaikaisesta kehityksestä huolimatta käytännön sovellukset perustuvat usein yksinkertaisille tilastollisille malleille johtuen niiden paremmasta ymmärrettävyydestä. Tässä diplomityössä rakennetaan ennustemalli suomalaisten PK-yritysten konkurssiriskin määritykseen käyttäen tilinpäätösdataa vuosilta 2008–2010. Tavoitteena on tarkka, mutta käyttäjäystävällinen ja helposti tulkittava malli. Konkurssimallinnuksen keskeisiin osa-alueisiin perehdytään kirjallisuuskatsauksessa, jonka pohjalta suunnitellaan empiirinen tutkimus. Viiden mallinnusmenetelmän suoriutumista vertaillaan erilaisia opetusaineiston ja ennustemuuttujien osajoukkoja käyttäen, ja löydösten perusteella kaksi parasta menetelmää otetaan lähempään tarkasteluun. Satunnaismetsä (random forest) -koneoppimismenetelmää käyttävä, kaikkia saatavilla olevia ennustemuuttujia ja muokkaamatonta, epäsuhtaisesti konkurssi- ja ei-konkurssitapauksia sisältävää opetusaineistoa hyödyntävä malli toimii parhaiten. Keskeisten suorituskykymittarien valossa satunnaismetsämalli suoriutuu käytettyä verrokkia paremmin, ja todetaan helppokäyttöiseksi ja hyvin tulkittavaksi; sitä suositellaan sovellettavaksi käytäntöön. Omavaraisuusaste ja rahoituskulujen suhde taseen loppusummaan osoittautuvat johdonmukaisesti parhaiksi ennustemuuttujiksi eri mallinnusmetodeilla, mutta muilta osin havainnot muuttujien keskinäisestä paremmuudesta ovat vaihtelevia. Tämä diplomityö osoittaa, että konkurssiennustemalli voi olla sekä tarkka että käytännöllinen, ja tarjoaa suuntaviivoja tuleville tutkimuksille. Empiiristen havaintojen ja kirjallisuuslöydösten pohjalta esitetään jatkotutkimusehdotuksia. Erityisen tärkeä huomio on se, että konkurssiennustamisessa tyypillisesti käytettyjen suorituskykymittarien soveltuvuus on kyseenalaista konkurssitapausten harvinaisuudesta johtuen. Muilla tutkimusaloilla laajasti käytetty tarkkuus-saantikäyrän alle jäävä pinta-ala (PR AUC) todetaan soveliaaksi vaihtoehdoksi, ja sitä suositellaan käytettäväksi konkurssimallien suorituskyvyn mittaukseen. Avainsanat konkurssien ennustaminen, luottoriski, koneoppiminen

    Strategic Financial Management and Nigeria’s Non-Oil Export Boost: Unit Root – Causality Treat

    Get PDF
    Intermittent sub-optimal financial economies of Nigeria’s non-oil sector are traceable to incoherent schematic booster commitments (SBCs). Although the oil sector remains dominant in the political economy of Nigeria, it has necessarily but not sufficiently helped development matters, particularly with respect to new enterprise creation and employment generation. To re-enact the complementary potency of other viable sectors, this study harnesses secondary data on Nigeria’s non-oil exports, exchange rate, foreign exchange earnings, and gross domestic product from publications of the Central bank of Nigeria (CBN) over a period of 25 years. The relevant time series are subjected to unit root, regression and causality statistical analytical process. The results complementarily establish significant relationship between gross domestic product and non-oil exports complemented by the other predictor variables. Recent macroeconomic performance postings in this regard are impressive and ipso facto justify every resolve to accord greater SBCs to non-oil commercial/industrial activities in the Nigerian economy. However, in line with the ideals of strategic financial management, efficient coordination of focal institutional initiatives and incentives, especially of the Nigerian Export Promotion Council (NEPC) and Nigeria Export – Import Bank (NEXIM), is vitally critical for the boost to make a boast in the global economy. Key Words: Export incentives, Non-oil sector, Strategic financing managemen

    Forecasting with Big Data: A Review

    Get PDF
    Big Data is a revolutionary phenomenon which is one of the most frequently discussed topics in the modern age, and is expected to remain so in the foreseeable future. In this paper we present a comprehensive review on the use of Big Data for forecasting by identifying and reviewing the problems, potential, challenges and most importantly the related applications. Skills, hardware and software, algorithm architecture, statistical significance, the signal to noise ratio and the nature of Big Data itself are identified as the major challenges which are hindering the process of obtaining meaningful forecasts from Big Data. The review finds that at present, the fields of Economics, Energy and Population Dynamics have been the major exploiters of Big Data forecasting whilst Factor models, Bayesian models and Neural Networks are the most common tools adopted for forecasting with Big Data

    Aprendizaje supervisado mediante random forests

    Get PDF
    Muchos problemas de la vida real pueden modelarse como problemas de clasificación, tales como la detección temprana de enfermedades o la concesión de crédito a un cierto individuo. La Clasificación Supervisada se encarga de este tipo de problemas: aprende de una muestra con el objetivo final de inferir observaciones futuras. Hoy en día, existe una amplia gama de técnicas de Clasificación Supervisada. En este trabajo nos centramos en los bosques aleatorios (Random Forests). El Random Forests es una técnica de clasificación que consiste en construir una colección de árboles de decisión individuales sobre los cuales se aplica aleatoriedad de cierta manera. Es conocido que esta técnica proporciona un buen rendimiento, incluso cuando trata con problemas de gran escala como los que se tienen en la actualidad. Sin embargo, existe una pequeña brecha entre la teoría relacionada con esta técnica y la experiencia empírica de la misma. El Random Forests también es útil en otros campos del Aprendizaje Automático: da medidas de importancia de las variables, que podrían utilizarse en la Selección de Atributos, y una matriz de proximidades entre las observaciones, lo que permite al analista detectar valores atípicos, reemplazar valores perdidos, buscar prototipos y obtener una visualización comprensible de los datos. Estas últimas propiedades hacen que el Random Forests sea una técnica aún más atractiva. En este trabajo se hace, en primer lugar, una breve descripción de la Clasificación Supervisada, incluyendo las principales técnicas de validación y los criterios de rendimiento más relevantes. En segundo lugar, se explica en detalle la construcción de un árbol de clasificación. Seguidamente, se presenta el Random Forests y se revisan las propiedades principales del mismo. Por último, se muestran resultados experimentales en R.Universidad de Sevilla. Máster Universitario en Matemática
    corecore