5,421 research outputs found

    Ants constructing rule-based classifiers.

    Get PDF
    Classifiers; Data; Data mining; Studies;

    GOexpress: an R/Bioconductor package for the identification and visualisation of robust gene ontology signatures through supervised learning of gene expression data

    Get PDF
    Background: Identification of gene expression profiles that differentiate experimental groups is critical for discovery and analysis of key molecular pathways and also for selection of robust diagnostic or prognostic biomarkers. While integration of differential expression statistics has been used to refine gene set enrichment analyses, such approaches are typically limited to single gene lists resulting from simple two-group comparisons or time-series analyses. In contrast, functional class scoring and machine learning approaches provide powerful alternative methods to leverage molecular measurements for pathway analyses, and to compare continuous and multi-level categorical factors. Results: We introduce GOexpress, a software package for scoring and summarising the capacity of gene ontology features to simultaneously classify samples from multiple experimental groups. GOexpress integrates normalised gene expression data (e.g., from microarray and RNA-seq experiments) and phenotypic information of individual samples with gene ontology annotations to derive a ranking of genes and gene ontology terms using a supervised learning approach. The default random forest algorithm allows interactions between all experimental factors, and competitive scoring of expressed genes to evaluate their relative importance in classifying predefined groups of samples. Conclusions: GOexpress enables rapid identification and visualisation of ontology-related gene panels that robustly classify groups of samples and supports both categorical (e.g., infection status, treatment) and continuous (e.g., time-series, drug concentrations) experimental factors. The use of standard Bioconductor extension packages and publicly available gene ontology annotations facilitates straightforward integration of GOexpress within existing computational biology pipelines.Department of Agriculture, Food and the MarineEuropean Commission - Seventh Framework Programme (FP7)Science Foundation IrelandUniversity College Dubli

    Recent methods from statistics and machine learning for credit scoring

    Get PDF
    Credit scoring models are the basis for financial institutions like retail and consumer credit banks. The purpose of the models is to evaluate the likelihood of credit applicants defaulting in order to decide whether to grant them credit. The area under the receiver operating characteristic (ROC) curve (AUC) is one of the most commonly used measures to evaluate predictive performance in credit scoring. The aim of this thesis is to benchmark different methods for building scoring models in order to maximize the AUC. While this measure is used to evaluate the predictive accuracy of the presented algorithms, the AUC is especially introduced as direct optimization criterion. The logistic regression model is the most widely used method for creating credit scorecards and classifying applicants into risk classes. Since this development process, based on the logit model, is standard in the retail banking practice, the predictive accuracy of this proceeding is used for benchmark reasons throughout this thesis. The AUC approach is a main task introduced within this work. Instead of using the maximum likelihood estimation, the AUC is considered as objective function to optimize it directly. The coefficients are estimated by calculating the AUC measure with Wilcoxon-Mann-Whitney and by using the Nelder-Mead algorithm for the optimization. The AUC optimization denotes a distribution-free approach, which is analyzed within a simulation study for investigating the theoretical considerations. It can be shown that the approach still works even if the underlying distribution is not logistic. In addition to the AUC approach and classical well-known methods like generalized additive models, new methods from statistics and machine learning are evaluated for the credit scoring case. Conditional inference trees, model-based recursive partitioning methods and random forests are presented as recursive partitioning algorithms. Boosting algorithms are also explored by additionally using the AUC as a loss function. The empirical evaluation is based on data from a German bank. From the application scoring, 26 attributes are included in the analysis. Besides the AUC, different performance measures are used for evaluating the predictive performance of scoring models. While classification trees cannot improve predictive accuracy for the current credit scoring case, the AUC approach and special boosting methods provide outperforming results compared to the robust classical scoring models regarding the predictive performance with the AUC measure.Scoringmodelle dienen Finanzinstituten als Grundlage dafür, die Ausfallwahrscheinlichkeit von Kreditantragstellern zu berechnen und zu entscheiden ob ein Kredit gewährt wird oder nicht. Das AUC (area under the receiver operating characteristic curve) ist eines der am häufigsten verwendeten Maße, um die Vorhersagekraft im Kreditscoring zu bewerten. Demzufolge besteht das Ziel dieser Arbeit darin, verschiedene Methoden zur Scoremodell-Bildung hinsichtlich eines optimierten AUC Maßes zu „benchmarken“. Während das genannte Maß dazu dient die vorgestellten Algorithmen hinsichtlich ihrer Trennschärfe zu bewerten, wird das AUC insbesondere als direktes Optimierungskriterium eingeführt. Die logistische Regression ist das am häufigsten verwendete Verfahren zur Entwicklung von Scorekarten und die Einteilung der Antragsteller in Risikoklassen. Da der Entwicklungsprozess mittels logistischer Regression im Retail-Bankenbereich stark etabliert ist, wird die Trennschärfe dieses Verfahrens in der vorliegenden Arbeit als Benchmark verwendet. Der AUC Ansatz wird als entscheidender Teil dieser Arbeit vorgestellt. Anstatt die Maximum Likelihood Schätzung zu verwenden, wird das AUC als direkte Zielfunktion zur Optimierung verwendet. Die Koeffizienten werden geschätzt, indem für die Berechnung des AUC die Wilcoxon Statistik und für die Optimierung der Nelder-Mead Algorithmus verwendet wird. Die AUC Optimierung stellt einen verteilungsfreien Ansatz dar, der im Rahmen einer Simulationsstudie untersucht wird, um die theoretischen Überlegungen zu analysieren. Es kann gezeigt werden, dass der Ansatz auch dann funktioniert, wenn in den Daten kein logistischer Zusammenhang vorliegt. Zusätzlich zum AUC Ansatz und bekannten Methoden wie Generalisierten Additiven Modellen, werden neue Methoden aus der Statistik und dem Machine Learning für das Kreditscoring evaluiert. Klassifikationsbäume, Modell-basierte Recursive Partitioning Methoden und Random Forests werden als Recursive Paritioning Methoden vorgestellt. Darüberhinaus werden Boosting Algorithmen untersucht, die auch das AUC Maß als Verlustfunktion verwenden. Die empirische Analyse basiert auf Daten einer deutschen Kreditbank. 26 Variablen werden im Rahmen der Analyse untersucht. Neben dem AUC Maß werden verschiedene Performancemaße verwendet, um die Trennschärfe von Scoringmodellen zu bewerten. Während Klassifikationsbäume im vorliegenden Kreditscoring Fall keine Verbesserungen erzielen, weisen der AUC Ansatz und einige Boosting Verfahren gute Ergebnisse im Vergleich zum robusten klassischen Scoringmodell hinsichtlich des AUC Maßes auf

    Ranking users, papers and authors in online scientific communities

    Get PDF
    The ever-increasing quantity and complexity of scientific production have made it difficult for researchers to keep track of advances in their own fields. This, together with growing popularity of online scientific communities, calls for the development of effective information filtering tools. We propose here a method to simultaneously compute reputation of users and quality of scientific artifacts in an online scientific community. Evaluation on artificially-generated data and real data from the Econophysics Forum is used to determine the method's best-performing variants. We show that when the method is extended by considering author credit, its performance improves on multiple levels. In particular, top papers have higher citation count and top authors have higher hh-index than top papers and top authors chosen by other algorithms.Comment: 7 pages, 3 figures, 3 table

    A Hybrid Technological Innovation Text Mining, Ensemble Learning and Risk Scorecard Approach for Enterprise Credit Risk Assessment

    Get PDF
    Enterprise credit risk assessment models typically use financial-based information as a predictor variable, relying on backward-looking historical information rather than forward-looking information for risk assessment. We propose a novel hybrid assessment of credit risk that uses technological innovation information as a predictor variable. Text mining techniques are used to extract this information for each enterprise. A combination of random forest and extreme gradient boosting are used for indicator screening, and finally, risk scorecard based on logistic regression is used for credit risk scoring. Our results show that technological innovation indicators obtained through text mining provide valuable information for credit risk assessment, and that the combination of ensemble learning from random forest and extreme gradient boosting combinations with logistic regression models outperforms other traditional methods. The best results achieved 0.9129 area under receiver operating characteristic. In addition, our approach provides meaningful scoring rules for credit risk assessment of technology innovation enterprises

    Machine learning-driven credit risk: a systemic review

    Get PDF
    Credit risk assessment is at the core of modern economies. Traditionally, it is measured by statistical methods and manual auditing. Recent advances in financial artificial intelligence stemmed from a new wave of machine learning (ML)-driven credit risk models that gained tremendous attention from both industry and academia. In this paper, we systematically review a series of major research contributions (76 papers) over the past eight years using statistical, machine learning and deep learning techniques to address the problems of credit risk. Specifically, we propose a novel classification methodology for ML-driven credit risk algorithms and their performance ranking using public datasets. We further discuss the challenges including data imbalance, dataset inconsistency, model transparency, and inadequate utilization of deep learning models. The results of our review show that: 1) most deep learning models outperform classic machine learning and statistical algorithms in credit risk estimation, and 2) ensemble methods provide higher accuracy compared with single models. Finally, we present summary tables in terms of datasets and proposed models

    Slave to the Algorithm? Why a \u27Right to an Explanation\u27 Is Probably Not the Remedy You Are Looking For

    Get PDF
    Algorithms, particularly machine learning (ML) algorithms, are increasingly important to individuals’ lives, but have caused a range of concerns revolving mainly around unfairness, discrimination and opacity. Transparency in the form of a “right to an explanation” has emerged as a compellingly attractive remedy since it intuitively promises to open the algorithmic “black box” to promote challenge, redress, and hopefully heightened accountability. Amidst the general furore over algorithmic bias we describe, any remedy in a storm has looked attractive. However, we argue that a right to an explanation in the EU General Data Protection Regulation (GDPR) is unlikely to present a complete remedy to algorithmic harms, particularly in some of the core “algorithmic war stories” that have shaped recent attitudes in this domain. Firstly, the law is restrictive, unclear, or even paradoxical concerning when any explanation-related right can be triggered. Secondly, even navigating this, the legal conception of explanations as “meaningful information about the logic of processing” may not be provided by the kind of ML “explanations” computer scientists have developed, partially in response. ML explanations are restricted both by the type of explanation sought, the dimensionality of the domain and the type of user seeking an explanation. However, “subject-centric explanations (SCEs) focussing on particular regions of a model around a query show promise for interactive exploration, as do explanation systems based on learning a model from outside rather than taking it apart (pedagogical versus decompositional explanations) in dodging developers\u27 worries of intellectual property or trade secrets disclosure. Based on our analysis, we fear that the search for a “right to an explanation” in the GDPR may be at best distracting, and at worst nurture a new kind of “transparency fallacy.” But all is not lost. We argue that other parts of the GDPR related (i) to the right to erasure ( right to be forgotten ) and the right to data portability; and (ii) to privacy by design, Data Protection Impact Assessments and certification and privacy seals, may have the seeds we can use to make algorithms more responsible, explicable, and human-centered
    corecore