18 research outputs found

    Fast calculation of boundary crossing probabilities for Poisson processes

    Full text link
    The boundary crossing probability of a Poisson process with nn jumps is a fundamental quantity with numerous applications. We present a fast O(n2logn)O(n^2 \log n) algorithm to calculate this probability for arbitrary upper and lower boundaries.Comment: 8 pages, 2 figures, associated C++ code is available at http://www.wisdom.weizmann.ac.il/~amitm

    Fast calculation of p-values for one-sided Kolmogorov-Smirnov type statistics

    Full text link
    We present a method for computing exact p-values for a large family of one-sided continuous goodness-of-fit statistics. This includes the higher criticism statistic, one-sided weighted Kolmogorov-Smirnov statistics, and the one-sided Berk-Jones statistics. For a sample size of 10,000, our method takes merely 0.15 seconds to run and it scales to sample sizes in the hundreds of thousands. This allows practitioners working on genome-wide association studies and other high-dimensional analyses to use exact finite-sample computations instead of statistic-specific approximation schemes. Our work has other applications in statistics, including power analysis, finding alpha-level thresholds for goodness-of-fit tests, and the construction of confidence bands for the empirical distribution function. The algorithm is based on a reduction to the boundary-crossing probability of a pure jump process and is also applicable to fields outside of statistics, for example in financial risk modeling.Comment: 22 pages, 3 figures. Supplementary code is included under the crossprob and benchmarks directorie

    On the cross-validation bias due to unsupervised pre-processing

    Full text link
    Cross-validation is the de facto standard for predictive model evaluation and selection. In proper use, it provides an unbiased estimate of a model's predictive performance. However, data sets often undergo various forms of data-dependent preprocessing, such as mean-centering, rescaling, dimensionality reduction, and outlier removal. It is often believed that such preprocessing stages, if done in an unsupervised manner (that does not incorporate the class labels or response values) are generally safe to do prior to cross-validation. In this paper, we study three commonly-practiced preprocessing procedures prior to a regression analysis: (i) variance-based feature selection; (ii) grouping of rare categorical features; and (iii) feature rescaling. We demonstrate that unsupervised preprocessing procedures can, in fact, introduce a large bias into cross-validation estimates and potentially lead to sub-optimal model selection. This bias may be either positive or negative and its exact magnitude depends on all the parameters of the problem in an intricate manner. Further research is needed to understand the real-world impact of this bias across different application domains, particularly when dealing with low sample counts and high-dimensional data.Comment: 29 pages, 4 figures, 1 tabl

    Hidden Markov modeling of single particle diffusion with stochastic tethering

    Full text link
    The statistics of the diffusive motion of particles often serve as an experimental proxy for their interaction with the environment. However, inferring the physical properties from the observed trajectories is challenging. Inspired by a recent experiment, here we analyze the problem of particles undergoing two-dimensional Brownian motion with transient tethering to the surface. We model the problem as a Hidden Markov Model where the physical position is observed, and the tethering state is hidden. We develop an alternating maximization algorithm to infer the hidden state of the particle and estimate the physical parameters of the system. The crux of our method is a saddle-point-like approximation, which involves finding the most likely sequence of hidden states and estimating the physical parameters from it. Extensive numerical tests demonstrate that our algorithm reliably finds the model parameters, and is insensitive to the initial guess. We discuss the different regimes of physical parameters and the algorithm's performance in these regimes. We also provide a ready-to-use open source implementation of our algorithm.Comment: 10 pages, 7 figure

    Cryo-EM reconstruction of continuous heterogeneity by Laplacian spectral volumes

    Full text link
    Single-particle electron cryomicroscopy is an essential tool for high-resolution 3D reconstruction of proteins and other biological macromolecules. An important challenge in cryo-EM is the reconstruction of non-rigid molecules with parts that move and deform. Traditional reconstruction methods fail in these cases, resulting in smeared reconstructions of the moving parts. This poses a major obstacle for structural biologists, who need high-resolution reconstructions of entire macromolecules, moving parts included. To address this challenge, we present a new method for the reconstruction of macromolecules exhibiting continuous heterogeneity. The proposed method uses projection images from multiple viewing directions to construct a graph Laplacian through which the manifold of three-dimensional conformations is analyzed. The 3D molecular structures are then expanded in a basis of Laplacian eigenvectors, using a novel generalized tomographic reconstruction algorithm to compute the expansion coefficients. These coefficients, which we name spectral volumes, provide a high-resolution visualization of the molecular dynamics. We provide a theoretical analysis and evaluate the method empirically on several simulated data sets.Comment: 33 pages, 10 figure

    Earthmover-based manifold learning for analyzing molecular conformation spaces

    Full text link
    In this paper, we propose a novel approach for manifold learning that combines the Earthmover's distance (EMD) with the diffusion maps method for dimensionality reduction. We demonstrate the potential benefits of this approach for learning shape spaces of proteins and other flexible macromolecules using a simulated dataset of 3-D density maps that mimic the non-uniform rotary motion of ATP synthase. Our results show that EMD-based diffusion maps require far fewer samples to recover the intrinsic geometry than the standard diffusion maps algorithm that is based on the Euclidean distance. To reduce the computational burden of calculating the EMD for all volume pairs, we employ a wavelet-based approximation to the EMD which reduces the computation of the pairwise EMDs to a computation of pairwise weighted-1\ell_1 distances between wavelet coefficient vectors.Comment: 5 pages, 4 figures, 1 tabl
    corecore