6 research outputs found
The Hunting of the Bump: On Maximizing Statistical Discrepancy
Anomaly detection has important applications in biosurveilance and
environmental monitoring. When comparing measured data to data drawn from a
baseline distribution, merely, finding clusters in the measured data may not
actually represent true anomalies. These clusters may likely be the clusters of
the baseline distribution. Hence, a discrepancy function is often used to
examine how different measured data is to baseline data within a region. An
anomalous region is thus defined to be one with high discrepancy.
In this paper, we present algorithms for maximizing statistical discrepancy
functions over the space of axis-parallel rectangles. We give provable
approximation guarantees, both additive and relative, and our methods apply to
any convex discrepancy function. Our algorithms work by connecting statistical
discrepancy to combinatorial discrepancy; roughly speaking, we show that in
order to maximize a convex discrepancy function over a class of shapes, one
needs only maximize a linear discrepancy function over the same set of shapes.
We derive general discrepancy functions for data generated from a one-
parameter exponential family. This generalizes the widely-used Kulldorff scan
statistic for data from a Poisson distribution. We present an algorithm running
in that computes the maximum
discrepancy rectangle to within additive error , for the Kulldorff
scan statistic. Similar results hold for relative error and for discrepancy
functions for data coming from Gaussian, Bernoulli, and gamma distributions.
Prior to our work, the best known algorithms were exact and ran in time
.Comment: 11 pages. A short version of this paper will appear in SODA06. This
full version contains an additional short appendi
Detection with the scan and the average likelihood ratio
We investigate the performance of the scan (maximum likelihood ratio
statistic) and of the average likelihood ratio statistic in the problem of
detecting a deterministic signal with unknown spatial extent in the
prototypical univariate sampled data model with white Gaussian noise. Our
results show that the scan statistic, a popular tool for detection problems, is
optimal only for the detection of signals with the smallest spatial extent. For
signals with larger spatial extent the scan is suboptimal, and the power loss
can be considerable. In contrast, the average likelihood ratio statistic is
optimal for the detection of signals on all scales except the smallest ones,
where its performance is only slightly suboptimal. We give rigorous
mathematical statements of these results as well as heuristic explanations
which suggest that the essence of these findings applies to detection problems
quite generally, such as the detection of clusters in models involving
densities or intensities or the detection of multivariate signals. We present a
modification of the average likelihood ratio that yields optimal detection of
signals with arbitrary spatial extent and which has the additional benefit of
allowing for a fast computation of the statistic. In contrast, optimal
detection with the scan seems to require the use of scale-dependent critical
values
Early Detection of Tuberculosis Outbreaks among the San Francisco Homeless: Trade-Offs Between Spatial Resolution and Temporal Scale
BACKGROUND: San Francisco has the highest rate of tuberculosis (TB) in the U.S. with recurrent outbreaks among the homeless and marginally housed. It has been shown for syndromic data that when exact geographic coordinates of individual patients are used as the spatial base for outbreak detection, higher detection rates and accuracy are achieved compared to when data are aggregated into administrative regions such as zip codes and census tracts. We examine the effect of varying the spatial resolution in the TB data within the San Francisco homeless population on detection sensitivity, timeliness, and the amount of historical data needed to achieve better performance measures. METHODS AND FINDINGS: We apply a variation of space-time permutation scan statistic to the TB data in which a patient's location is either represented by its exact coordinates or by the centroid of its census tract. We show that the detection sensitivity and timeliness of the method generally improve when exact locations are used to identify real TB outbreaks. When outbreaks are simulated, while the detection timeliness is consistently improved when exact coordinates are used, the detection sensitivity varies depending on the size of the spatial scanning window and the number of tracts in which cases are simulated. Finally, we show that when exact locations are used, smaller amount of historical data is required for training the model. CONCLUSION: Systematic characterization of the spatio-temporal distribution of TB cases can widely benefit real time surveillance and guide public health investigations of TB outbreaks as to what level of spatial resolution results in improved detection sensitivity and timeliness. Trading higher spatial resolution for better performance is ultimately a tradeoff between maintaining patient confidentiality and improving public health when sharing data. Understanding such tradeoffs is critical to managing the complex interplay between public policy and public health. This study is a step forward in this direction