12,585 research outputs found

    The VGAM Package for Categorical Data Analysis

    Get PDF
    Classical categorical regression models such as the multinomial logit and proportional odds models are shown to be readily handled by the vector generalized linear and additive model (VGLM/VGAM) framework. Additionally, there are natural extensions, such as reduced-rank VGLMs for dimension reduction, and allowing covariates that have values specific to each linear/additive predictor, e.g., for consumer choice modeling. This article describes some of the framework behind the VGAM R package, its usage and implementation details.

    Categorical data analysis using a skewed Weibull regression model

    Full text link
    In this paper, we present a Weibull link (skewed) model for categorical response data arising from binomial as well as multinomial model. We show that, for such types of categorical data, the most commonly used models (logit, probit and complementary log-log) can be obtained as limiting cases. We further compare the proposed model with some other asymmetrical models. The Bayesian as well as frequentist estimation procedures for binomial and multinomial data responses are presented in details. The analysis of two data sets to show the efficiency of the proposed model is performed

    BIOS 6531 - Categorical Data Analysis

    Get PDF
    This course introduces statistical methods for analyzing both univariate and multivariate categorical data and count in medical research and other health-related fields. The course will introduce how to distinguish among the different measurement scales, the commonly used statistical probability distribution and inference methods for categorical and count data. Emphasis will be placed on the application of the methodology and computational aspects rather than theory. The students will learn how to apply SAS procedures to data and interpret the results

    BIOS 6531 - Categorical Data Analysis

    Get PDF
    (taken from 2020‑21 Course Catalog): This course introduces statistical methods for analyzing both univariate and multivariate categorical data and count in medical research and other health-related fields. The course will introduce how to distinguish among the different measurement scales, the commonly used statistical probability distribution and inference methods for categorical and count data. Emphasis will be placed on the application of the methodology and computational aspects rather than theory. The students will learn how to apply SAS procedures to data and interpret the results

    PUBH 6541 - Categorical Data Analysis

    Get PDF
    (taken from 2012-13 Course Catalog): This course introduces statistical methods for analyzing both univariate and multivariate categorical data and count in medical research and other health-related fields. The course will introduce how to distinguish among the different measurement scales, the commonly used statistical probability distribution and inference methods for categorical and count data. Emphasis will be placed on the application of the methodology and computational aspects rather than theory. The students will learn how to apply SAS procedures to data and interpret the results

    BIOS 6531 – Categorical Data Analysis

    Get PDF
    (taken from 2017-18 Course Catalog): This course introduces statistical methods for analyzing both univariate and multivariate categorical data and count in medical research and other health-related fields. The course will introduce how to distinguish among the different measurement scales, the commonly used statistical probability distribution and inference methods for categorical and count data. Emphasis will be placed on the application of the methodology and computational aspects rather than theory. The students will learn how to apply SAS procedures to data and interpret the results

    BIOS 6531 – Categorical Data Analysis

    Get PDF
    This course introduces statistical methods for analyzing both univariate and multivariate categorical data and count in medical research and other health-related fields. The course will introduce how to distinguish among the different measurement scales, the commonly used statistical probability distribution and inference methods for categorical and count data. Emphasis will be placed on the application of the methodology and computational aspects rather than theory. The students will learn how to apply SAS procedures to data and interpret the results

    ProbCD: enrichment analysis accounting for categorization uncertainty

    Get PDF
    As in many other areas of science, systems biology makes extensive use of statistical association and significance estimates in contingency tables, a type of categorical data analysis known in this field as enrichment (also over-representation or enhancement) analysis. In spite of efforts to create probabilistic annotations, especially in the Gene Ontology context, or to deal with uncertainty in high throughput-based datasets, current enrichment methods largely ignore this probabilistic information since they are mainly based on variants of the Fisher Exact Test. We developed an open-source R package to deal with probabilistic categorical data analysis, ProbCD, that does not require a static contingency table. The contingency table for
the enrichment problem is built using the expectation of a Bernoulli Scheme stochastic process given the categorization probabilities. An on-line interface was created to allow usage by non-programmers and is available at: http://xerad.systemsbiology.net/ProbCD/. We present an analysis framework and software tools to address the issue of uncertainty in categorical data analysis. In particular, concerning the enrichment analysis, ProbCD can accommodate: (i) the stochastic nature of the high-throughput experimental techniques and (ii) probabilistic gene annotation

    RANDOMIZATION BASED PRIVACY PRESERVING CATEGORICAL DATA ANALYSIS

    Get PDF
    The success of data mining relies on the availability of high quality data. To ensure quality data mining, effective information sharing between organizations becomes a vital requirement in today’s society. Since data mining often involves sensitive infor- mation of individuals, the public has expressed a deep concern about their privacy. Privacy-preserving data mining is a study of eliminating privacy threats while, at the same time, preserving useful information in the released data for data mining. This dissertation investigates data utility and privacy of randomization-based mod- els in privacy preserving data mining for categorical data. For the analysis of data utility in randomization model, we first investigate the accuracy analysis for associ- ation rule mining in market basket data. Then we propose a general framework to conduct theoretical analysis on how the randomization process affects the accuracy of various measures adopted in categorical data analysis. We also examine data utility when randomization mechanisms are not provided to data miners to achieve better privacy. We investigate how various objective associ- ation measures between two variables may be affected by randomization. We then extend it to multiple variables by examining the feasibility of hierarchical loglinear modeling. Our results provide a reference to data miners about what they can do and what they can not do with certainty upon randomized data directly without the knowledge about the original distribution of data and distortion information. Data privacy and data utility are commonly considered as a pair of conflicting re- quirements in privacy preserving data mining applications. In this dissertation, we investigate privacy issues in randomization models. In particular, we focus on the attribute disclosure under linking attack in data publishing. We propose efficient so- lutions to determine optimal distortion parameters such that we can maximize utility preservation while still satisfying privacy requirements. We compare our randomiza- tion approach with l-diversity and anatomy in terms of utility preservation (under the same privacy requirements) from three aspects (reconstructed distributions, accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss
    corecore