18 research outputs found
Fast calculation of boundary crossing probabilities for Poisson processes
The boundary crossing probability of a Poisson process with jumps is a
fundamental quantity with numerous applications. We present a fast algorithm to calculate this probability for arbitrary upper and lower
boundaries.Comment: 8 pages, 2 figures, associated C++ code is available at
http://www.wisdom.weizmann.ac.il/~amitm
Fast calculation of p-values for one-sided Kolmogorov-Smirnov type statistics
We present a method for computing exact p-values for a large family of
one-sided continuous goodness-of-fit statistics. This includes the higher
criticism statistic, one-sided weighted Kolmogorov-Smirnov statistics, and the
one-sided Berk-Jones statistics. For a sample size of 10,000, our method takes
merely 0.15 seconds to run and it scales to sample sizes in the hundreds of
thousands. This allows practitioners working on genome-wide association studies
and other high-dimensional analyses to use exact finite-sample computations
instead of statistic-specific approximation schemes.
Our work has other applications in statistics, including power analysis,
finding alpha-level thresholds for goodness-of-fit tests, and the construction
of confidence bands for the empirical distribution function. The algorithm is
based on a reduction to the boundary-crossing probability of a pure jump
process and is also applicable to fields outside of statistics, for example in
financial risk modeling.Comment: 22 pages, 3 figures. Supplementary code is included under the
crossprob and benchmarks directorie
On the cross-validation bias due to unsupervised pre-processing
Cross-validation is the de facto standard for predictive model evaluation and
selection. In proper use, it provides an unbiased estimate of a model's
predictive performance. However, data sets often undergo various forms of
data-dependent preprocessing, such as mean-centering, rescaling, dimensionality
reduction, and outlier removal. It is often believed that such preprocessing
stages, if done in an unsupervised manner (that does not incorporate the class
labels or response values) are generally safe to do prior to cross-validation.
In this paper, we study three commonly-practiced preprocessing procedures prior
to a regression analysis: (i) variance-based feature selection; (ii) grouping
of rare categorical features; and (iii) feature rescaling. We demonstrate that
unsupervised preprocessing procedures can, in fact, introduce a large bias into
cross-validation estimates and potentially lead to sub-optimal model selection.
This bias may be either positive or negative and its exact magnitude depends on
all the parameters of the problem in an intricate manner. Further research is
needed to understand the real-world impact of this bias across different
application domains, particularly when dealing with low sample counts and
high-dimensional data.Comment: 29 pages, 4 figures, 1 tabl
Hidden Markov modeling of single particle diffusion with stochastic tethering
The statistics of the diffusive motion of particles often serve as an
experimental proxy for their interaction with the environment. However,
inferring the physical properties from the observed trajectories is
challenging. Inspired by a recent experiment, here we analyze the problem of
particles undergoing two-dimensional Brownian motion with transient tethering
to the surface. We model the problem as a Hidden Markov Model where the
physical position is observed, and the tethering state is hidden. We develop an
alternating maximization algorithm to infer the hidden state of the particle
and estimate the physical parameters of the system. The crux of our method is a
saddle-point-like approximation, which involves finding the most likely
sequence of hidden states and estimating the physical parameters from it.
Extensive numerical tests demonstrate that our algorithm reliably finds the
model parameters, and is insensitive to the initial guess. We discuss the
different regimes of physical parameters and the algorithm's performance in
these regimes. We also provide a ready-to-use open source implementation of our
algorithm.Comment: 10 pages, 7 figure
Cryo-EM reconstruction of continuous heterogeneity by Laplacian spectral volumes
Single-particle electron cryomicroscopy is an essential tool for
high-resolution 3D reconstruction of proteins and other biological
macromolecules. An important challenge in cryo-EM is the reconstruction of
non-rigid molecules with parts that move and deform. Traditional reconstruction
methods fail in these cases, resulting in smeared reconstructions of the moving
parts. This poses a major obstacle for structural biologists, who need
high-resolution reconstructions of entire macromolecules, moving parts
included. To address this challenge, we present a new method for the
reconstruction of macromolecules exhibiting continuous heterogeneity. The
proposed method uses projection images from multiple viewing directions to
construct a graph Laplacian through which the manifold of three-dimensional
conformations is analyzed. The 3D molecular structures are then expanded in a
basis of Laplacian eigenvectors, using a novel generalized tomographic
reconstruction algorithm to compute the expansion coefficients. These
coefficients, which we name spectral volumes, provide a high-resolution
visualization of the molecular dynamics. We provide a theoretical analysis and
evaluate the method empirically on several simulated data sets.Comment: 33 pages, 10 figure
Earthmover-based manifold learning for analyzing molecular conformation spaces
In this paper, we propose a novel approach for manifold learning that
combines the Earthmover's distance (EMD) with the diffusion maps method for
dimensionality reduction. We demonstrate the potential benefits of this
approach for learning shape spaces of proteins and other flexible
macromolecules using a simulated dataset of 3-D density maps that mimic the
non-uniform rotary motion of ATP synthase. Our results show that EMD-based
diffusion maps require far fewer samples to recover the intrinsic geometry than
the standard diffusion maps algorithm that is based on the Euclidean distance.
To reduce the computational burden of calculating the EMD for all volume pairs,
we employ a wavelet-based approximation to the EMD which reduces the
computation of the pairwise EMDs to a computation of pairwise weighted-
distances between wavelet coefficient vectors.Comment: 5 pages, 4 figures, 1 tabl
