768 research outputs found

    Learning with many experts: model selection and sparsity

    Full text link
    Experts classifying data are often imprecise. Recently, several models have been proposed to train classifiers using the noisy labels generated by these experts. How to choose between these models? In such situations, the true labels are unavailable. Thus, one cannot perform model selection using the standard versions of methods such as empirical risk minimization and cross validation. In order to allow model selection, we present a surrogate loss and provide theoretical guarantees that assure its consistency. Next, we discuss how this loss can be used to tune a penalization which introduces sparsity in the parameters of a traditional class of models. Sparsity provides more parsimonious models and can avoid overfitting. Nevertheless, it has seldom been discussed in the context of noisy labels due to the difficulty in model selection and, therefore, in choosing tuning parameters. We apply these techniques to several sets of simulated and real data.Comment: This is the pre-peer reviewed versio

    Pharmacoepidemiol Drug Saf

    Get PDF
    PurposeTo estimate the accuracy of two algorithms to identify cholecystectomy procedures using International Classification of Diseases, 9th Edition, Clinical Modification (ICD-9-CM) and Current Procedural Terminology (CPT-4) codes in administrative data.MethodsPrivate insurer medical claims for 30,853 patients 18\ue2\u20ac\u201c64 years with an inpatient hospitalization between 2006 and 2010, as indicated by providers/facilities place of service in addition to room and board charges, were cross-classified according to the presence of codes for cholecystectomy. The accuracy of ICD-9-CM- and CPT-4-based algorithms was estimated using a Bayesian latent class model.ResultsThe sensitivity and specificity were 0.92 [probability interval (PI): 0.92, 0.92] and 0.99 (PI: 0.97, 0.99) for ICD-9-CM-, and 0.93 (PI: 0.92, 0.93) and 0.99 (PI: 0.97, 0.99) for CPT-4-based algorithms, respectively. The parallel-joint scheme, where positivity of either algorithm was considered a positive outcome, yielded a sensitivity and specificity of 0.99 (PI: 0.99, 0.99) and 0.97 (PI: 0.95, 0.99), respectively.ConclusionsBoth ICD-9-CM- and CPT-4-based algorithms had high sensitivity to identify cholecystectomy procedures in administrative data when used individually and especially in a parallel-joint approach.R01 HS019713/HS/AHRQ HHS/United StatesU54 CK000162/CK/NCEZID CDC HHS/United StatesFOA# CK11-001/CK/NCEZID CDC HHS/United States2017-03-01T00:00:00Z26349484PMC477535

    A survey of statistical network models

    Full text link
    Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology from the 1960s, these early works generated an active network community and a substantial literature in the 1970s. This effort moved into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning network literature in statistical physics and computer science. The growth of the World Wide Web and the emergence of online networking communities such as Facebook, MySpace, and LinkedIn, and a host of more specialized professional network communities has intensified interest in the study of networks and network data. Our goal in this review is to provide the reader with an entry point to this burgeoning literature. We begin with an overview of the historical development of statistical network modeling and then we introduce a number of examples that have been studied in the network literature. Our subsequent discussion focuses on a number of prominent static and dynamic network models and their interconnections. We emphasize formal model descriptions, and pay special attention to the interpretation of parameters and their estimation. We end with a description of some open problems and challenges for machine learning and statistics.Comment: 96 pages, 14 figures, 333 reference

    Maximum likelihood estimation of a finite mixture of logistic regression models in a continuous data stream

    Get PDF
    In marketing we are often confronted with a continuous stream of responses to marketing messages. Such streaming data provide invaluable information regarding message effectiveness and segmentation. However, streaming data are hard to analyze using conventional methods: their high volume and the fact that they are continuously augmented means that it takes considerable time to analyze them. We propose a method for estimating a finite mixture of logistic regression models which can be used to cluster customers based on a continuous stream of responses. This method, which we coin oFMLR, allows segments to be identified in data streams or extremely large static datasets. Contrary to black box algorithms, oFMLR provides model estimates that are directly interpretable. We first introduce oFMLR, explaining in passing general topics such as online estimation and the EM algorithm, making this paper a high level overview of possible methods of dealing with large data streams in marketing practice. Next, we discuss model convergence, identifiability, and relations to alternative, Bayesian, methods; we also identify more general issues that arise from dealing with continuously augmented data sets. Finally, we introduce the oFMLR [R] package and evaluate the method by numerical simulation and by analyzing a large customer clickstream dataset.Comment: 1 figure. Working paper including [R] packag

    Statistical modelling of categorical data under ontic and epistemic imprecision

    Get PDF

    Causal inference methods for combining randomized trials and observational studies: a review

    Full text link
    With increasing data availability, causal treatment effects can be evaluated across different datasets, both randomized controlled trials (RCTs) and observational studies. RCTs isolate the effect of the treatment from that of unwanted (confounding) co-occurring effects. But they may struggle with inclusion biases, and thus lack external validity. On the other hand, large observational samples are often more representative of the target population but can conflate confounding effects with the treatment of interest. In this paper, we review the growing literature on methods for causal inference on combined RCTs and observational studies, striving for the best of both worlds. We first discuss identification and estimation methods that improve generalizability of RCTs using the representativeness of observational data. Classical estimators include weighting, difference between conditional outcome models, and doubly robust estimators. We then discuss methods that combine RCTs and observational data to improve (conditional) average treatment effect estimation, handling possible unmeasured confounding in the observational data. We also connect and contrast works developed in both the potential outcomes framework and the structural causal model framework. Finally, we compare the main methods using a simulation study and real world data to analyze the effect of tranexamic acid on the mortality rate in major trauma patients. Code to implement many of the methods is provided

    Inference for Constrained Estimation of Tumor Size Distributions

    Full text link
    In order to develop better treatment and screening programs for cancer prevention programs, it is important to be able to understand the natural history of the disease and what factors affect its progression. We focus on a particular framework first outlined by Kimmel and Flehinger (1991, Biometrics , 47, 987–1004) and in particular one of their limiting scenarios for analysis. Using an equivalence with a binary regression model, we characterize the nonparametric maximum likelihood estimation procedure for estimation of the tumor size distribution function and give associated asymptotic results. Extensions to semiparametric models and missing data are also described. Application to data from two cancer studies is used to illustrate the finite-sample behavior of the procedure.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/65536/1/j.1541-0420.2008.01001.x.pd

    BayesBinMix: an R Package for Model Based Clustering of Multivariate Binary Data

    Full text link
    The BayesBinMix package offers a Bayesian framework for clustering binary data with or without missing values by fitting mixtures of multivariate Bernoulli distributions with an unknown number of components. It allows the joint estimation of the number of clusters and model parameters using Markov chain Monte Carlo sampling. Heated chains are run in parallel and accelerate the convergence to the target posterior distribution. Identifiability issues are addressed by implementing label switching algorithms. The package is demonstrated and benchmarked against the Expectation-Maximization algorithm using a simulation study as well as a real dataset.Comment: Accepted to the R Journal. The package is available on CRAN: https://CRAN.R-project.org/package=BayesBinMi

    Estimating Sizes of Key Populations at the National Level: Considerations for Study Design and Analysis.

    Get PDF
    BACKGROUND: National estimates of the sizes of key populations, including female sex workers, men who have sex with men, and transgender women are critical to inform national and international responses to the HIV pandemic. However, epidemiologic studies typically provide size estimates for only limited high priority geographic areas. This article illustrates a two-stage approach to obtain a national key population size estimate in the Dominican Republic using available estimates and publicly available contextual information. METHODS: Available estimates of key population size in priority areas were augmented with targeted additional data collection in other areas. To combine information from data collected at each stage, we used statistical methods for handling missing data, including inverse probability weights, multiple imputation, and augmented inverse probability weights. RESULTS: Using the augmented inverse probability weighting approach, which provides some protection against parametric model misspecification, we estimated that 3.7% (95% CI = 2.9, 4.7) of the total population of women in the Dominican Republic between the ages of 15 and 49 years were engaged in sex work, 1.2% (95% CI = 1.1, 1.3) of men aged 15-49 had sex with other men, and 0.19% (95% CI = 0.17, 0.21) of people assigned the male sex at birth were transgender. CONCLUSIONS: Viewing the size estimation of key populations as a missing data problem provides a framework for articulating and evaluating the assumptions necessary to obtain a national size estimate. In addition, this paradigm allows use of methods for missing data familiar to epidemiologists
    • …
    corecore