1,094 research outputs found
Multi-dimensional Point Process Models in R
A software package for fitting and assessing multi-dimensional point process models using the R sta- tistical computing environment is described. Methods of residual analysis based on random thinning are discussed and implemented. Features of the software are demonstrated using data on wildfire occurrences in Northern Los Angeles County, California
Predicting Patient No-Shows in Community Health Clinics: A Case Study in Designing a Data Analytic Product
The data science revolution has highlighted the varying roles that data
analytic products can play in a different industries and applications. There
has been particular interest in using analytic products coupled with
algorithmic prediction models to aid in human decision-making. However,
detailed descriptions of the decision-making process that leads to the design
and development of analytic products are lacking in the statistical literature,
making it difficult to accumulate a body of knowledge where students interested
in the field of data science may look to learn about this process. In this
paper, we present a case study describing the development of an analytic
product for predicting whether patients will show up for scheduled appointments
at a community health clinic. We consider the stakeholders involved and their
interests, along with the real-world analytical and technical trade-offs
involved in developing and deploying the product. Our goal here is to highlight
the decisions made and evaluate them in the context of possible alternatives.
We find that although this case study has some unique characteristics, there
are lessons to be learned that could translate to other settings and
applications
Caching and Visualizing Statistical Analyses
We present the cacher and CodeDepends packages for R, which provide tools for (1) caching and analyzing the code for statistical analyses and (2) distributing these analyses to others in an efficient manner over the web. The cacher package takes objects created by evaluating R expressions and stores them in key-value databases. These databases of cached objects can subsequently be assembled into “cache packages” for distribution over the web. The cacher package also provides tools to help readers examine the data and code in a statistical analysis and reproduce, modify, or improve upon the results. In addition, readers can easily conduct alternate analyses of the data. The CodeDepends package provides complementary tools for analyzing and visualizing the code for a statistical analysis and this functionality has been integrated into the cacher package. In this chapter we describe the cacher and CodeDepends packages and provide examples of how they can be used for reproducible research
Spatial Misalignment in time series studies of air pollution and health data
Time series studies of environmental exposures often involve comparing daily changes in a toxicant measured at a point in space with daily changes in an aggregate measure of health. Spatial misalignment of the exposure and response variables can bias the estimation of health risk and the magnitude of this bias depends on the spatial variation of the exposure of interest. In air pollution epidemiology, there is an increasing focus on estimating the health effects of the chemical components of particulate matter. One issue that is raised by this new focus is the spatial misalignment error introduced by the lack of spatial homogeneity in many of the particulate matter components. Current approaches to estimating short-term health risks via time series modeling do not take into account the spatial properties of the chemical components and therefore could result in biased estimation of those risks. We present a spatial-temporal statistical model for quantifying spatial misalignment error and show how adjusted heath risk estimates can be obtained using a regression calibration approach and a two-stage Bayesian model. We apply our methods to a database containing information on hospital admissions, air pollution, and weather for 20 large urban counties in the United States
Modeling Data Analytic Iteration With Probabilistic Outcome Sets
In 1977 John Tukey described how in exploratory data analysis, data analysts
use tools, such as data visualizations, to separate their expectations from
what they observe. In contrast to statistical theory, an underappreciated
aspect of data analysis is that a data analyst must make decisions by comparing
the observed data or output from a statistical tool to what the analyst
previously expected from the data. However, there is little formal guidance for
how to make these data analytic decisions as statistical theory generally omits
a discussion of who is using these statistical methods. In this paper, we
propose a model for the iterative process of data analysis based on the
analyst's expectations, using what we refer to as expected and anomaly
probabilistic outcome sets, and the concept of statistical information gain.
Here, we extend the basic idea of comparing an analyst's expectations to what
is observed in a data visualization to more general analytic situations. Our
model posits that the analyst's goal is to increase the amount of information
the analyst has relative to what the analyst already knows, through successive
analytic iterations. We introduce two criteria--expected information gain and
anomaly information gain--to provide guidance about analytic decision-making
and ultimately to improve the practice of data analysis. Finally, we show how
our framework can be used to characterize common situations in practical data
analysis.Comment: 30 page
- …