20,334 research outputs found
The Hunting of the Bump: On Maximizing Statistical Discrepancy
Anomaly detection has important applications in biosurveilance and
environmental monitoring. When comparing measured data to data drawn from a
baseline distribution, merely, finding clusters in the measured data may not
actually represent true anomalies. These clusters may likely be the clusters of
the baseline distribution. Hence, a discrepancy function is often used to
examine how different measured data is to baseline data within a region. An
anomalous region is thus defined to be one with high discrepancy.
In this paper, we present algorithms for maximizing statistical discrepancy
functions over the space of axis-parallel rectangles. We give provable
approximation guarantees, both additive and relative, and our methods apply to
any convex discrepancy function. Our algorithms work by connecting statistical
discrepancy to combinatorial discrepancy; roughly speaking, we show that in
order to maximize a convex discrepancy function over a class of shapes, one
needs only maximize a linear discrepancy function over the same set of shapes.
We derive general discrepancy functions for data generated from a one-
parameter exponential family. This generalizes the widely-used Kulldorff scan
statistic for data from a Poisson distribution. We present an algorithm running
in that computes the maximum
discrepancy rectangle to within additive error , for the Kulldorff
scan statistic. Similar results hold for relative error and for discrepancy
functions for data coming from Gaussian, Bernoulli, and gamma distributions.
Prior to our work, the best known algorithms were exact and ran in time
.Comment: 11 pages. A short version of this paper will appear in SODA06. This
full version contains an additional short appendi
On the Catalyzing Effect of Randomness on the Per-Flow Throughput in Wireless Networks
This paper investigates the throughput capacity of a flow crossing a
multi-hop wireless network, whose geometry is characterized by general
randomness laws including Uniform, Poisson, Heavy-Tailed distributions for both
the nodes' densities and the number of hops. The key contribution is to
demonstrate \textit{how} the \textit{per-flow throughput} depends on the
distribution of 1) the number of nodes inside hops' interference sets, 2)
the number of hops , and 3) the degree of spatial correlations. The
randomness in both 's and is advantageous, i.e., it can yield larger
scalings (as large as ) than in non-random settings. An interesting
consequence is that the per-flow capacity can exhibit the opposite behavior to
the network capacity, which was shown to suffer from a logarithmic decrease in
the presence of randomness. In turn, spatial correlations along the end-to-end
path are detrimental by a logarithmic term
An algorithm for constrained one-step inversion of spectral CT data
We develop a primal-dual algorithm that allows for one-step inversion of
spectral CT transmission photon counts data to a basis map decomposition. The
algorithm allows for image constraints to be enforced on the basis maps during
the inversion. The derivation of the algorithm makes use of a local upper
bounding quadratic approximation to generate descent steps for non-convex
spectral CT data discrepancy terms, combined with a new convex-concave
optimization algorithm. Convergence of the algorithm is demonstrated on
simulated spectral CT data. Simulations with noise and anthropomorphic phantoms
show examples of how to employ the constrained one-step algorithm for spectral
CT data.Comment: Submitted to Physics in Medicine and Biolog
Revisiting Guerry's data: Introducing spatial constraints in multivariate analysis
Standard multivariate analysis methods aim to identify and summarize the main
structures in large data sets containing the description of a number of
observations by several variables. In many cases, spatial information is also
available for each observation, so that a map can be associated to the
multivariate data set. Two main objectives are relevant in the analysis of
spatial multivariate data: summarizing covariation structures and identifying
spatial patterns. In practice, achieving both goals simultaneously is a
statistical challenge, and a range of methods have been developed that offer
trade-offs between these two objectives. In an applied context, this
methodological question has been and remains a major issue in community
ecology, where species assemblages (i.e., covariation between species
abundances) are often driven by spatial processes (and thus exhibit spatial
patterns). In this paper we review a variety of methods developed in community
ecology to investigate multivariate spatial patterns. We present different ways
of incorporating spatial constraints in multivariate analysis and illustrate
these different approaches using the famous data set on moral statistics in
France published by Andr\'{e}-Michel Guerry in 1833. We discuss and compare the
properties of these different approaches both from a practical and theoretical
viewpoint.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS356 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Recommended from our members
Validating Variational Bayes Linear Regression Method With Multi-Central Datasets.
PurposeTo validate the prediction accuracy of variational Bayes linear regression (VBLR) with two datasets external to the training dataset.MethodThe training dataset consisted of 7268 eyes of 4278 subjects from the University of Tokyo Hospital. The Japanese Archive of Multicentral Databases in Glaucoma (JAMDIG) dataset consisted of 271 eyes of 177 patients, and the Diagnostic Innovations in Glaucoma Study (DIGS) dataset includes 248 eyes of 173 patients, which were used for validation. Prediction accuracy was compared between the VBLR and ordinary least squared linear regression (OLSLR). First, OLSLR and VBLR were carried out using total deviation (TD) values at each of the 52 test points from the second to fourth visual fields (VFs) (VF2-4) to 2nd to 10th VF (VF2-10) of each patient in JAMDIG and DIGS datasets, and the TD values of the 11th VF test were predicted every time. The predictive accuracy of each method was compared through the root mean squared error (RMSE) statistic.ResultsOLSLR RMSEs with the JAMDIG and DIGS datasets were between 31 and 4.3 dB, and between 19.5 and 3.9 dB. On the other hand, VBLR RMSEs with JAMDIG and DIGS datasets were between 5.0 and 3.7, and between 4.6 and 3.6 dB. There was statistically significant difference between VBLR and OLSLR for both datasets at every series (VF2-4 to VF2-10) (P < 0.01 for all tests). However, there was no statistically significant difference in VBLR RMSEs between JAMDIG and DIGS datasets at any series of VFs (VF2-2 to VF2-10) (P > 0.05).ConclusionsVBLR outperformed OLSLR to predict future VF progression, and the VBLR has a potential to be a helpful tool at clinical settings
- âŠ