768 research outputs found
Learning with many experts: model selection and sparsity
Experts classifying data are often imprecise. Recently, several models have
been proposed to train classifiers using the noisy labels generated by these
experts. How to choose between these models? In such situations, the true
labels are unavailable. Thus, one cannot perform model selection using the
standard versions of methods such as empirical risk minimization and cross
validation. In order to allow model selection, we present a surrogate loss and
provide theoretical guarantees that assure its consistency. Next, we discuss
how this loss can be used to tune a penalization which introduces sparsity in
the parameters of a traditional class of models. Sparsity provides more
parsimonious models and can avoid overfitting. Nevertheless, it has seldom been
discussed in the context of noisy labels due to the difficulty in model
selection and, therefore, in choosing tuning parameters. We apply these
techniques to several sets of simulated and real data.Comment: This is the pre-peer reviewed versio
Pharmacoepidemiol Drug Saf
PurposeTo estimate the accuracy of two algorithms to identify
cholecystectomy procedures using International Classification of Diseases,
9th Edition, Clinical Modification (ICD-9-CM) and Current Procedural
Terminology (CPT-4) codes in administrative data.MethodsPrivate insurer medical claims for 30,853 patients 18\ue2\u20ac\u201c64 years
with an inpatient hospitalization between 2006 and 2010, as indicated by
providers/facilities place of service in addition to room and board charges,
were cross-classified according to the presence of codes for
cholecystectomy. The accuracy of ICD-9-CM- and CPT-4-based algorithms was
estimated using a Bayesian latent class model.ResultsThe sensitivity and specificity were 0.92 [probability interval (PI):
0.92, 0.92] and 0.99 (PI: 0.97, 0.99) for ICD-9-CM-, and 0.93 (PI: 0.92,
0.93) and 0.99 (PI: 0.97, 0.99) for CPT-4-based algorithms, respectively.
The parallel-joint scheme, where positivity of either algorithm was
considered a positive outcome, yielded a sensitivity and specificity of 0.99
(PI: 0.99, 0.99) and 0.97 (PI: 0.95, 0.99), respectively.ConclusionsBoth ICD-9-CM- and CPT-4-based algorithms had high sensitivity to
identify cholecystectomy procedures in administrative data when used
individually and especially in a parallel-joint approach.R01 HS019713/HS/AHRQ HHS/United StatesU54 CK000162/CK/NCEZID CDC HHS/United StatesFOA# CK11-001/CK/NCEZID CDC HHS/United States2017-03-01T00:00:00Z26349484PMC477535
A survey of statistical network models
Networks are ubiquitous in science and have become a focal point for
discussion in everyday life. Formal statistical models for the analysis of
network data have emerged as a major topic of interest in diverse areas of
study, and most of these involve a form of graphical representation.
Probability models on graphs date back to 1959. Along with empirical studies in
social psychology and sociology from the 1960s, these early works generated an
active network community and a substantial literature in the 1970s. This effort
moved into the statistical literature in the late 1970s and 1980s, and the past
decade has seen a burgeoning network literature in statistical physics and
computer science. The growth of the World Wide Web and the emergence of online
networking communities such as Facebook, MySpace, and LinkedIn, and a host of
more specialized professional network communities has intensified interest in
the study of networks and network data. Our goal in this review is to provide
the reader with an entry point to this burgeoning literature. We begin with an
overview of the historical development of statistical network modeling and then
we introduce a number of examples that have been studied in the network
literature. Our subsequent discussion focuses on a number of prominent static
and dynamic network models and their interconnections. We emphasize formal
model descriptions, and pay special attention to the interpretation of
parameters and their estimation. We end with a description of some open
problems and challenges for machine learning and statistics.Comment: 96 pages, 14 figures, 333 reference
Maximum likelihood estimation of a finite mixture of logistic regression models in a continuous data stream
In marketing we are often confronted with a continuous stream of responses to
marketing messages. Such streaming data provide invaluable information
regarding message effectiveness and segmentation. However, streaming data are
hard to analyze using conventional methods: their high volume and the fact that
they are continuously augmented means that it takes considerable time to
analyze them. We propose a method for estimating a finite mixture of logistic
regression models which can be used to cluster customers based on a continuous
stream of responses. This method, which we coin oFMLR, allows segments to be
identified in data streams or extremely large static datasets. Contrary to
black box algorithms, oFMLR provides model estimates that are directly
interpretable. We first introduce oFMLR, explaining in passing general topics
such as online estimation and the EM algorithm, making this paper a high level
overview of possible methods of dealing with large data streams in marketing
practice. Next, we discuss model convergence, identifiability, and relations to
alternative, Bayesian, methods; we also identify more general issues that arise
from dealing with continuously augmented data sets. Finally, we introduce the
oFMLR [R] package and evaluate the method by numerical simulation and by
analyzing a large customer clickstream dataset.Comment: 1 figure. Working paper including [R] packag
Causal inference methods for combining randomized trials and observational studies: a review
With increasing data availability, causal treatment effects can be evaluated
across different datasets, both randomized controlled trials (RCTs) and
observational studies. RCTs isolate the effect of the treatment from that of
unwanted (confounding) co-occurring effects. But they may struggle with
inclusion biases, and thus lack external validity. On the other hand, large
observational samples are often more representative of the target population
but can conflate confounding effects with the treatment of interest. In this
paper, we review the growing literature on methods for causal inference on
combined RCTs and observational studies, striving for the best of both worlds.
We first discuss identification and estimation methods that improve
generalizability of RCTs using the representativeness of observational data.
Classical estimators include weighting, difference between conditional outcome
models, and doubly robust estimators. We then discuss methods that combine RCTs
and observational data to improve (conditional) average treatment effect
estimation, handling possible unmeasured confounding in the observational data.
We also connect and contrast works developed in both the potential outcomes
framework and the structural causal model framework. Finally, we compare the
main methods using a simulation study and real world data to analyze the effect
of tranexamic acid on the mortality rate in major trauma patients. Code to
implement many of the methods is provided
Inference for Constrained Estimation of Tumor Size Distributions
In order to develop better treatment and screening programs for cancer prevention programs, it is important to be able to understand the natural history of the disease and what factors affect its progression. We focus on a particular framework first outlined by Kimmel and Flehinger (1991, Biometrics , 47, 987–1004) and in particular one of their limiting scenarios for analysis. Using an equivalence with a binary regression model, we characterize the nonparametric maximum likelihood estimation procedure for estimation of the tumor size distribution function and give associated asymptotic results. Extensions to semiparametric models and missing data are also described. Application to data from two cancer studies is used to illustrate the finite-sample behavior of the procedure.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/65536/1/j.1541-0420.2008.01001.x.pd
BayesBinMix: an R Package for Model Based Clustering of Multivariate Binary Data
The BayesBinMix package offers a Bayesian framework for clustering binary
data with or without missing values by fitting mixtures of multivariate
Bernoulli distributions with an unknown number of components. It allows the
joint estimation of the number of clusters and model parameters using Markov
chain Monte Carlo sampling. Heated chains are run in parallel and accelerate
the convergence to the target posterior distribution. Identifiability issues
are addressed by implementing label switching algorithms. The package is
demonstrated and benchmarked against the Expectation-Maximization algorithm
using a simulation study as well as a real dataset.Comment: Accepted to the R Journal. The package is available on CRAN:
https://CRAN.R-project.org/package=BayesBinMi
Estimating Sizes of Key Populations at the National Level: Considerations for Study Design and Analysis.
BACKGROUND: National estimates of the sizes of key populations, including female sex workers, men who have sex with men, and transgender women are critical to inform national and international responses to the HIV pandemic. However, epidemiologic studies typically provide size estimates for only limited high priority geographic areas. This article illustrates a two-stage approach to obtain a national key population size estimate in the Dominican Republic using available estimates and publicly available contextual information. METHODS: Available estimates of key population size in priority areas were augmented with targeted additional data collection in other areas. To combine information from data collected at each stage, we used statistical methods for handling missing data, including inverse probability weights, multiple imputation, and augmented inverse probability weights. RESULTS: Using the augmented inverse probability weighting approach, which provides some protection against parametric model misspecification, we estimated that 3.7% (95% CI = 2.9, 4.7) of the total population of women in the Dominican Republic between the ages of 15 and 49 years were engaged in sex work, 1.2% (95% CI = 1.1, 1.3) of men aged 15-49 had sex with other men, and 0.19% (95% CI = 0.17, 0.21) of people assigned the male sex at birth were transgender. CONCLUSIONS: Viewing the size estimation of key populations as a missing data problem provides a framework for articulating and evaluating the assumptions necessary to obtain a national size estimate. In addition, this paradigm allows use of methods for missing data familiar to epidemiologists
- …