Search CORE

16 research outputs found

On calibration of nested dichotomies

Author: A Beygelzimer
A Kumar
AH Murphy
CC Chang
F Pedregosa
J Fox
J Platt
K Dembczyński
L Dong
O Russakovsky
P Mahé
R Rifkin
T Hastie
T Leathart
TG Dietterich
V Melnikov
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Nested dichotomies (NDs) are used as a method of transforming a multiclass classification problem into a series of binary problems. A tree structure is induced that recursively splits the set of classes into subsets, and a binary classification model learns to discriminate between the two subsets of classes at each node. In this paper, we demonstrate that these NDs typically exhibit poor probability calibration, even when the binary base models are well-calibrated. We also show that this problem is exacerbated when the binary models are poorly calibrated. We discuss the effectiveness of different calibration strategies and show that accuracy and log-loss can be significantly improved by calibrating both the internal base models and the full ND structure, especially when the number of classes is high

Crossref

Research Commons@Waikato

Data Mining and Analysis on Multiple Time Series Object Data

Author: Jiang Chunyu
Publication venue: CORE Scholar
Publication date: 01/01/2007
Field of study

Huge amount of data is available in our society and the need for turning such data into useful information and knowledge is urgent. Data mining is an important field addressing that need and significant progress has been achieved in the last decade. In several important application areas, data arises in the format of Multiple Time Series Object (MTSO) data, where each data object is an array of time series over a large set of features and each has an associated class or state. Very little research has been conducted towards this kind of data. Examples include computational toxicology, where each data object consists of a set of time series over thousands of genes, and operational stress management, where each data object consists of a set of time series over different measuring points on the human body. The purpose of this dissertation is to conduct a systematic data mining study over microarray time series data, with applications on computational toxicology. More specifically, we aim to consider several issues: feature selection algorithms for different classification cases, gene markers or feature set selection for toxic chemical exposure detection, toxic chemical exposure time prediction, wildness concept development and applications, and organizing diversified and parsimonious committee. We will formalize and analyze these research problems, design algorithms to address these problems, and perform experimental evaluations of the proposed algorithms. All these studies are based on microarray time series data set provided by Dr. McDougal

OhioLINK Electronic Thesis and Dissertation Center

CORE

Conformal and Venn Predictors for Multi-probabilistic Predictions and Their Applications

Author: Zhou Chenzhe
Publication venue
Publication date: 01/01/2015
Field of study

Royal Holloway - Pure

Efficient Kernel Methods for Statistical Detection

Author: Su Wanhua
Publication venue: 'University of Waterloo'
Publication date: 20/03/2008
Field of study

This research is motivated by a drug discovery problem -- the AIDS anti-viral database from the National Cancer Institute. The objective of the study is to develop effective statistical methods to model the relationship between the chemical structure of a compound and its activity against the HIV-1 virus. And as a result, the structure-activity model can be used to predict the activity of new compounds and thus helps identify those active chemical compounds that can be used as drug candidates. Since active compounds are generally rare in a compound library, we recognize the drug discovery problem as an application of the so-called statistical detection problem. In a typical statistical detection problem, we have data {Xi,Yi}, where Xi is the predictor vector of the ith observation and Yi={0,1} is its class label. The objective of a statistical detection problem is to identify class-1 observations, which are extremely rare. Besides drug discovery problem, other applications of statistical detection include direct marketing and fraud detection. We propose a computationally efficient detection method called LAGO, which stands for "locally adjusted GO estimator". The original idea is inspired by an ancient game known today as "GO". The construction of LAGO consists of two steps. In the first step, we estimate the density of class 1 with an adaptive bandwidth kernel density estimator. The kernel functions are located at and only at the class-1 observations. The bandwidth of the kernel function centered at a certain class-1 observation is calculated as the average distance between this class-1 observation and its K-nearest class-0 neighbors. In the second step, we adjust the density estimated in the first step locally according to the density of class 0. It can be shown that the amount of adjustment in the second step is approximately inversely proportional to the bandwidth calculated in the first step. Application to the NCI data demonstrates that LAGO is superior to methods such as K nearest neighbors and support vector machines. One drawback of the existing LAGO is that it only provides a point estimate of a test point's possibility of being class 1, ignoring the uncertainty of the model. In the second part of this thesis, we present a Bayesian framework for LAGO, referred to as BLAGO. This Bayesian approach enables quantification of uncertainty. Non-informative priors are adopted. The posterior distribution is calculated over a grid of (K, alpha) pairs by integrating out beta0 and beta1 using the Laplace approximation, where K and alpha are two parameters to construct the LAGO score. The parameters beta0, beta1 are the coefficients of the logistic transformation that converts the LAGO score to the probability scale. BLAGO provides proper probabilistic predictions that have support on (0,1) and captures uncertainty of the predictions as well. By avoiding Markov chain Monte Carlo algorithms and using the Laplace approximation, BLAGO is computationally very efficient. Without the need of cross-validation, BLAGO is even more computationally efficient than LAGO

University of Waterloo's Institutional Repository

Extensions and Applications of Ensemble-of-trees Methods in Machine Learning

Author: Bleich Justin
Publication venue: ScholarlyCommons
Publication date: 01/01/2015
Field of study

Ensemble-of-trees algorithms have emerged to the forefront of machine learning due to their ability to generate high forecasting accuracy for a wide array of regression and classification problems. Classic ensemble methodologies such as random forests (RF) and stochastic gradient boosting (SGB) rely on algorithmic procedures to generate fits to data. In contrast, more recent ensemble techniques such as Bayesian Additive Regression Trees (BART) and Dynamic Trees (DT) focus on an underlying Bayesian probability model to generate the fits. These new probability model-based approaches show much promise versus their algorithmic counterparts, but also offer substantial room for improvement. The first part of this thesis focuses on methodological advances for ensemble-of-trees techniques with an emphasis on the more recent Bayesian approaches. In particular, we focus on extensions of BART in four distinct ways. First, we develop a more robust implementation of BART for both research and application. We then develop a principled approach to variable selection for BART as well as the ability to naturally incorporate prior information on important covariates into the algorithm. Next, we propose a method for handling missing data that relies on the recursive structure of decision trees and does not require imputation. Last, we relax the assumption of homoskedasticity in the BART model to allow for parametric modeling of heteroskedasticity. The second part of this thesis returns to the classic algorithmic approaches in the context of classification problems with asymmetric costs of forecasting errors. First we consider the performance of RF and SGB more broadly and demonstrate its superiority to logistic regression for applications in criminology with asymmetric costs. Next, we use RF to forecast unplanned hospital readmissions upon patient discharge with asymmetric costs taken into account. Finally, we explore the construction of stable decision trees for forecasts of violence during probation hearings in court systems

CiteSeerX

ScholarlyCommons@Penn

Calibrating Margin-Based Classifier Scores into Polychotomous Probabilities

Author: EL ALLWEIN
JAK SUYKENS
M GEBEL
MH DEGROOT
NL JOHNSON
T HASTIE
T ZHANG
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

Crossref

Simulating Land Use Land Cover Change Using Data Mining and Machine Learning Algorithms

Author: Tayyebi Amin
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2013
Field of study

The objectives of this dissertation are to: (1) review the breadth and depth of land use land cover (LUCC) issues that are being addressed by the land change science community by discussing how an existing model, Purdue\u27s Land Transformation Model (LTM), has been used to better understand these very important issues; (2) summarize the current state-of-the-art in LUCC modeling in an attempt to provide a context for the advances in LUCC modeling presented here; (3) use a variety of statistical, data mining and machine learning algorithms to model single LUCC transitions in diverse regions of the world (e.g. United States and Africa) in order to determine which tools are most effective in modeling common LUCC patterns that are nonlinear; (4) develop new techniques for modeling multiple class (MC) transitions at the same time using existing LUCC models as these models are rare and in great demand; (5) reconfigure the existing LTM for urban growth boundary (UGB) simulation because UGB modeling has been ignored by the LUCC modeling community, and (6) compare two rule based models for urban growth boundary simulation for use in UGB land use planning. The review of LTM applications during the last decade indicates that a model like the LTM has addressed a majority of land change science issues although it has not explicitly been used to study terrestrial biodiversity issues. The review of the existing LUCC models indicates that there is no unique typology to differentiate between LUCC model structures and no models exist for UGB. Simulations designed to compare multiple models show that ANN-based LTM results are similar to Multivariate Adaptive Regression Spline (MARS)-based models and both ANN and MARS-based models outperform Classification and Regression Tree (CART)-based models for modeling single LULC transition; however, for modeling MC, an ANN-based LTM-MC is similar in goodness of fit to CART and both models outperform MARS in different regions of the world. In simulations across three regions (two in United States and one in Africa), the LTM had better goodness of fit measures while the outcome of CART and MARS were more interpretable and understandable than the ANN-based LTM. Modeling MC LUCC require the examination of several class separation rules and is thus more complicated than single LULC transition modeling; more research is clearly needed in this area. One of the greatest challenges identified with MC modeling is evaluating error distributions and map accuracies for multiple classes. A modified ANN-based LTM and a simple rule based UGBM outperformed a null model in all cardinal directions. For UGBM model to be useful for planning, other factors need to be considered including a separate routine that would determine urban quantity over time

Purdue E-Pubs

Recommended from our members

An analysis of industrial company failure in the UK and Russia for the 1990s

Author: Isachenkova Natalia
Publication venue: School of Social Sciences Theses
Publication date: 01/01/2004
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.This thesis provides an examination of the key determinants of industrial company failure in the UK and Russia, for the 1990s. For the UK, some new empirical evidence, presented for the 1990s recession period, is based on binary logit analyses of a cross-section and unbalanced panel of large quoted companies, using accounting-based indicators. Conventional for cross sectional studies empirical design of modelling the failure determinants separately for various risk-horizons, prior to the event of insolvency, is extended here by allowing for unanticipated changes in the nominal interest rate and in the real exchange rate, and also by controlling for the firm's age effect. We find that cross sectional models, conditioned on changes in overall economic conditions, dominate simpler models, utilising financial inputs alone, for comparisons of ex ante, out-of sample classificatory accuracy. Thus, the UK data suggest that for the years before and during the 1990s recession, shifts in the real exchange rate and rises in the nominal interest rate magnified dramatically the risk of failure of highly geared firms. The estimates from the fixed effects models indicate substantial unobserved heterogeneity across members of the panel and reveal that failing UK companies were less liquid, lacked profitability, and had declining net worth. For Russia, the evidence from binary logit is bootstrap-based and controlled by comparison with a similar random sample drawn for the UK over the recession years 1990-91. The Russian data uncover that, unlike in the UK, gearing and liquidity did not appear to explain enterprise liquidation in the mid-1990s, while lower profitability and smaller size were the key determinants of failure risk.Financial support from the ACE Tacis Programme, Contract T95-5127-S in 1996-98, and from the Department of Economics and Finance of Brunel University in 1999-2000 were used in this work

Brunel University Research Archive