16 research outputs found

    On calibration of nested dichotomies

    Get PDF
    Nested dichotomies (NDs) are used as a method of transforming a multiclass classification problem into a series of binary problems. A tree structure is induced that recursively splits the set of classes into subsets, and a binary classification model learns to discriminate between the two subsets of classes at each node. In this paper, we demonstrate that these NDs typically exhibit poor probability calibration, even when the binary base models are well-calibrated. We also show that this problem is exacerbated when the binary models are poorly calibrated. We discuss the effectiveness of different calibration strategies and show that accuracy and log-loss can be significantly improved by calibrating both the internal base models and the full ND structure, especially when the number of classes is high

    Data Mining and Analysis on Multiple Time Series Object Data

    Get PDF
    Huge amount of data is available in our society and the need for turning such data into useful information and knowledge is urgent. Data mining is an important field addressing that need and significant progress has been achieved in the last decade. In several important application areas, data arises in the format of Multiple Time Series Object (MTSO) data, where each data object is an array of time series over a large set of features and each has an associated class or state. Very little research has been conducted towards this kind of data. Examples include computational toxicology, where each data object consists of a set of time series over thousands of genes, and operational stress management, where each data object consists of a set of time series over different measuring points on the human body. The purpose of this dissertation is to conduct a systematic data mining study over microarray time series data, with applications on computational toxicology. More specifically, we aim to consider several issues: feature selection algorithms for different classification cases, gene markers or feature set selection for toxic chemical exposure detection, toxic chemical exposure time prediction, wildness concept development and applications, and organizing diversified and parsimonious committee. We will formalize and analyze these research problems, design algorithms to address these problems, and perform experimental evaluations of the proposed algorithms. All these studies are based on microarray time series data set provided by Dr. McDougal

    Efficient Kernel Methods for Statistical Detection

    Get PDF
    This research is motivated by a drug discovery problem -- the AIDS anti-viral database from the National Cancer Institute. The objective of the study is to develop effective statistical methods to model the relationship between the chemical structure of a compound and its activity against the HIV-1 virus. And as a result, the structure-activity model can be used to predict the activity of new compounds and thus helps identify those active chemical compounds that can be used as drug candidates. Since active compounds are generally rare in a compound library, we recognize the drug discovery problem as an application of the so-called statistical detection problem. In a typical statistical detection problem, we have data {Xi,Yi}, where Xi is the predictor vector of the ith observation and Yi={0,1} is its class label. The objective of a statistical detection problem is to identify class-1 observations, which are extremely rare. Besides drug discovery problem, other applications of statistical detection include direct marketing and fraud detection. We propose a computationally efficient detection method called LAGO, which stands for "locally adjusted GO estimator". The original idea is inspired by an ancient game known today as "GO". The construction of LAGO consists of two steps. In the first step, we estimate the density of class 1 with an adaptive bandwidth kernel density estimator. The kernel functions are located at and only at the class-1 observations. The bandwidth of the kernel function centered at a certain class-1 observation is calculated as the average distance between this class-1 observation and its K-nearest class-0 neighbors. In the second step, we adjust the density estimated in the first step locally according to the density of class 0. It can be shown that the amount of adjustment in the second step is approximately inversely proportional to the bandwidth calculated in the first step. Application to the NCI data demonstrates that LAGO is superior to methods such as K nearest neighbors and support vector machines. One drawback of the existing LAGO is that it only provides a point estimate of a test point's possibility of being class 1, ignoring the uncertainty of the model. In the second part of this thesis, we present a Bayesian framework for LAGO, referred to as BLAGO. This Bayesian approach enables quantification of uncertainty. Non-informative priors are adopted. The posterior distribution is calculated over a grid of (K, alpha) pairs by integrating out beta0 and beta1 using the Laplace approximation, where K and alpha are two parameters to construct the LAGO score. The parameters beta0, beta1 are the coefficients of the logistic transformation that converts the LAGO score to the probability scale. BLAGO provides proper probabilistic predictions that have support on (0,1) and captures uncertainty of the predictions as well. By avoiding Markov chain Monte Carlo algorithms and using the Laplace approximation, BLAGO is computationally very efficient. Without the need of cross-validation, BLAGO is even more computationally efficient than LAGO

    Extensions and Applications of Ensemble-of-trees Methods in Machine Learning

    Get PDF
    Ensemble-of-trees algorithms have emerged to the forefront of machine learning due to their ability to generate high forecasting accuracy for a wide array of regression and classification problems. Classic ensemble methodologies such as random forests (RF) and stochastic gradient boosting (SGB) rely on algorithmic procedures to generate fits to data. In contrast, more recent ensemble techniques such as Bayesian Additive Regression Trees (BART) and Dynamic Trees (DT) focus on an underlying Bayesian probability model to generate the fits. These new probability model-based approaches show much promise versus their algorithmic counterparts, but also offer substantial room for improvement. The first part of this thesis focuses on methodological advances for ensemble-of-trees techniques with an emphasis on the more recent Bayesian approaches. In particular, we focus on extensions of BART in four distinct ways. First, we develop a more robust implementation of BART for both research and application. We then develop a principled approach to variable selection for BART as well as the ability to naturally incorporate prior information on important covariates into the algorithm. Next, we propose a method for handling missing data that relies on the recursive structure of decision trees and does not require imputation. Last, we relax the assumption of homoskedasticity in the BART model to allow for parametric modeling of heteroskedasticity. The second part of this thesis returns to the classic algorithmic approaches in the context of classification problems with asymmetric costs of forecasting errors. First we consider the performance of RF and SGB more broadly and demonstrate its superiority to logistic regression for applications in criminology with asymmetric costs. Next, we use RF to forecast unplanned hospital readmissions upon patient discharge with asymmetric costs taken into account. Finally, we explore the construction of stable decision trees for forecasts of violence during probation hearings in court systems

    Calibrating Margin-Based Classifier Scores into Polychotomous Probabilities

    No full text

    Simulating Land Use Land Cover Change Using Data Mining and Machine Learning Algorithms

    Get PDF
    The objectives of this dissertation are to: (1) review the breadth and depth of land use land cover (LUCC) issues that are being addressed by the land change science community by discussing how an existing model, Purdue\u27s Land Transformation Model (LTM), has been used to better understand these very important issues; (2) summarize the current state-of-the-art in LUCC modeling in an attempt to provide a context for the advances in LUCC modeling presented here; (3) use a variety of statistical, data mining and machine learning algorithms to model single LUCC transitions in diverse regions of the world (e.g. United States and Africa) in order to determine which tools are most effective in modeling common LUCC patterns that are nonlinear; (4) develop new techniques for modeling multiple class (MC) transitions at the same time using existing LUCC models as these models are rare and in great demand; (5) reconfigure the existing LTM for urban growth boundary (UGB) simulation because UGB modeling has been ignored by the LUCC modeling community, and (6) compare two rule based models for urban growth boundary simulation for use in UGB land use planning. The review of LTM applications during the last decade indicates that a model like the LTM has addressed a majority of land change science issues although it has not explicitly been used to study terrestrial biodiversity issues. The review of the existing LUCC models indicates that there is no unique typology to differentiate between LUCC model structures and no models exist for UGB. Simulations designed to compare multiple models show that ANN-based LTM results are similar to Multivariate Adaptive Regression Spline (MARS)-based models and both ANN and MARS-based models outperform Classification and Regression Tree (CART)-based models for modeling single LULC transition; however, for modeling MC, an ANN-based LTM-MC is similar in goodness of fit to CART and both models outperform MARS in different regions of the world. In simulations across three regions (two in United States and one in Africa), the LTM had better goodness of fit measures while the outcome of CART and MARS were more interpretable and understandable than the ANN-based LTM. Modeling MC LUCC require the examination of several class separation rules and is thus more complicated than single LULC transition modeling; more research is clearly needed in this area. One of the greatest challenges identified with MC modeling is evaluating error distributions and map accuracies for multiple classes. A modified ANN-based LTM and a simple rule based UGBM outperformed a null model in all cardinal directions. For UGBM model to be useful for planning, other factors need to be considered including a separate routine that would determine urban quantity over time