86,882 research outputs found

    Analysis of Mixed Outcomes: Misclassified Binary Responses and Measurement Error in Covariates

    Get PDF
    The focus of this paper is on regression models for mixed binary and continuous outcomes, when the true predictor is measured with error and the binary responses are subject to classification errors. Latent variable is used to model the binary response. The joint distribution is expressed as a product of the marginal distribution of the continuous response and the conditional distribution of the binary response given the continuous response. Models are proposed to incorporate the measurement error and/or classification errors. Likelihood based analysis is performed to estimate the regression parameters of interest. Theoretical studies are made to find the bias of the likelihood estimates of the model parameters. An extensive simulation study is carried out to investigate the effect of ignoring classification errors and/or measurement error on the estimates of the model parameters. The methodology is illustrated with a data set obtained by conducting a small scale survey.

    Gibbs Max-margin Topic Models with Data Augmentation

    Full text link
    Max-margin learning is a powerful approach to building classifiers and structured output predictors. Recent work on max-margin supervised topic models has successfully integrated it with Bayesian topic models to discover discriminative latent semantic structures and make accurate predictions for unseen testing data. However, the resulting learning problems are usually hard to solve because of the non-smoothness of the margin loss. Existing approaches to building max-margin supervised topic models rely on an iterative procedure to solve multiple latent SVM subproblems with additional mean-field assumptions on the desired posterior distributions. This paper presents an alternative approach by defining a new max-margin loss. Namely, we present Gibbs max-margin supervised topic models, a latent variable Gibbs classifier to discover hidden topic representations for various tasks, including classification, regression and multi-task learning. Gibbs max-margin supervised topic models minimize an expected margin loss, which is an upper bound of the existing margin loss derived from an expected prediction rule. By introducing augmented variables and integrating out the Dirichlet variables analytically by conjugacy, we develop simple Gibbs sampling algorithms with no restricting assumptions and no need to solve SVM subproblems. Furthermore, each step of the "augment-and-collapse" Gibbs sampling algorithms has an analytical conditional distribution, from which samples can be easily drawn. Experimental results demonstrate significant improvements on time efficiency. The classification performance is also significantly improved over competitors on binary, multi-class and multi-label classification tasks.Comment: 35 page

    Variational Bayesian multinomial probit regression with Gaussian process priors

    Get PDF
    It is well known in the statistics literature that augmenting binary and polychotomous response models with Gaussian latent variables enables exact Bayesian analysis via Gibbs sampling from the parameter posterior. By adopting such a data augmentation strategy, dispensing with priors over regression coefficients in favour of Gaussian Process (GP) priors over functions, and employing variational approximations to the full posterior we obtain efficient computational methods for Gaussian Process classification in the multi-class setting. The model augmentation with additional latent variables ensures full a posteriori class coupling whilst retaining the simple a priori independent GP covariance structure from which sparse approximations, such as multi-class Informative Vector Machines (IVM), emerge in a very natural and straightforward manner. This is the first time that a fully Variational Bayesian treatment for multi-class GP classification has been developed without having to resort to additional explicit approximations to the non-Gaussian likelihood term. Empirical comparisons with exact analysis via MCMC and Laplace approximations illustrate the utility of the variational approximation as a computationally economic alternative to full MCMC and it is shown to be more accurate than the Laplace approximation

    randomLCA: An R Package for Latent Class with Random Effects Analysis

    Get PDF
    Latent class is a method for classifying subjects, originally based on binary outcome data but now extended to other data types. A major difficulty with the use of latent class models is the presence of heterogeneity of the outcome probabilities within the true classes, which violates the assumption of conditional independence, and will require a large number of classes to model the association in the data resulting in difficulties in interpretation. A solution is to include a normally distributed subject level random effect in the model so that the outcomes are now conditionally independent given both the class and random effect. A further extension is to incorporate an additional period level random effect when subjects are observed over time. The use of the randomLCA R package is demonstrated on three latent class examples: classification of subjects based on myocardial infarction symptoms, a diagnostic testing approach to comparing dentists in the diagnosis of dental caries and classification of infants based on respiratory and allergy symptoms over time

    Levels of disability in the older population of England: Comparing binary and ordinal classifications.

    Get PDF
    BACKGROUND: Recent studies suggest the importance of distinguishing severity levels of disability. Nevertheless, there is not yet a consensus with regards to an optimal classification. OBJECTIVE: Our study seeks to advance the existing binary definitions towards categorical/ordinal manifestations of disability. METHODS: We define disability according to the WHO's International Classification of Functioning, Disability and Health (ICF) using data collected at the baseline wave of the English Longitudinal Study of Aging, a longitudinal study of the non-institutionalized population, living in England. First, we identify cut-off points in the continuous disability score derived from ICF to distinguish disabled from no-disabled participants. Then, we fit latent class models to the same data to find the optimal number of disability classes according to: (i) model fit indicators; (ii) estimated probabilities of each disability item; (iii) association of the predicted disability classes with observed health and mortality. RESULTS: According to the binary classification criteria, about 32% of both men and women are classified disabled. No optimal number of classes emerged from the latent class models according to model fit indicators. However, the other two criteria suggest that the best-fitting model of disability severity has four classes. CONCLUSIONS: Our findings contribute to the debate on the usefulness and relevance of adopting a finer categorization of disability, by showing that binary indicators of disability averaged the burden of disability and masked the very strong effect experienced by individuals having severe disability, and were not informative for low levels of disability

    Modelling Instance-Level Annotator Reliability for Natural Language Labelling Tasks

    Full text link
    When constructing models that learn from noisy labels produced by multiple annotators, it is important to accurately estimate the reliability of annotators. Annotators may provide labels of inconsistent quality due to their varying expertise and reliability in a domain. Previous studies have mostly focused on estimating each annotator's overall reliability on the entire annotation task. However, in practice, the reliability of an annotator may depend on each specific instance. Only a limited number of studies have investigated modelling per-instance reliability and these only considered binary labels. In this paper, we propose an unsupervised model which can handle both binary and multi-class labels. It can automatically estimate the per-instance reliability of each annotator and the correct label for each instance. We specify our model as a probabilistic model which incorporates neural networks to model the dependency between latent variables and instances. For evaluation, the proposed method is applied to both synthetic and real data, including two labelling tasks: text classification and textual entailment. Experimental results demonstrate our novel method can not only accurately estimate the reliability of annotators across different instances, but also achieve superior performance in predicting the correct labels and detecting the least reliable annotators compared to state-of-the-art baselines.Comment: 9 pages, 1 figures, 10 tables, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL2019

    LATENT VARIABLE GENERALIZED LINEAR MODELS

    Get PDF
    Generalized Linear Models (GLMs) (McCullagh and Nelder, 1989) provide a unified framework for fixed effect models where response data arise from exponential family distributions. Much recent research has attempted to extend the framework to include random effects in the linear predictors. Different methodologies have been employed to solve different motivating problems, for example Generalized Linear Mixed Models (Clayton, 1994) and Multilevel Models (Goldstein, 1995). A thorough review and classification of this and related material is presented. In Item Response Theory (IRT) subjects are tested using banks of pre-calibrated test items. A useful model is based on the logistic function with a binary response dependent on the unknown ability of the subject. Item parameters contribute to the probability of a correct response. Within the framework of the GLM, a latent variable, the unknown ability, is introduced as a new component of the linear predictor. This approach affords the opportunity to structure intercept and slope parameters so that item characteristics are represented. A methodology for fitting such GLMs with latent variables, based on the EM algorithm (Dempster, Laird and Rubin, 1977) and using standard Generalized Linear Model fitting software GLIM (Payne, 1987) to perform the expectation step, is developed and applied to a model for binary response data. Accurate numerical integration to evaluate the likelihood functions is a vital part of the computational process. A study of the comparative benefits of two different integration strategies is undertaken and leads to the adoption, unusually, of Gauss-Legendre rules. It is shown how the fitting algorithms are implemented with GLIM programs which incorporate FORTRAN subroutines. Examples from IRT are given. A simulation study is undertaken to investigate the sampling distributions of the estimators and the effect of certain numerical attributes of the computational process. Finally a generalized latent variable model is developed for responses from any exponential family distribution
    • …
    corecore