307 research outputs found

    Fast calculation of p-values for one-sided Kolmogorov-Smirnov type statistics

    Full text link
    We present a method for computing exact p-values for a large family of one-sided continuous goodness-of-fit statistics. This includes the higher criticism statistic, one-sided weighted Kolmogorov-Smirnov statistics, and the one-sided Berk-Jones statistics. For a sample size of 10,000, our method takes merely 0.15 seconds to run and it scales to sample sizes in the hundreds of thousands. This allows practitioners working on genome-wide association studies and other high-dimensional analyses to use exact finite-sample computations instead of statistic-specific approximation schemes. Our work has other applications in statistics, including power analysis, finding alpha-level thresholds for goodness-of-fit tests, and the construction of confidence bands for the empirical distribution function. The algorithm is based on a reduction to the boundary-crossing probability of a pure jump process and is also applicable to fields outside of statistics, for example in financial risk modeling.Comment: 22 pages, 3 figures. Supplementary code is included under the crossprob and benchmarks directorie

    Cube Mentalism

    Get PDF
    Our tour of multidimensional cubes begins with the marking of the eight comers of a 3-cube with the eight words HOT, POT, POD, HOD, HAD, HAT, PAT, and PAD. The figure below shows how these eight inherit the labels of the HOT-PAD).die where the letter H is opposite P, the letter O is opposite T and the letter T is opposite the letter D..

    Farrell\u27s Spider

    Get PDF
    Puzzle game featured in Ivan Moscovich\u27s magnetic puzzle pack: Place the 18 discs on the web so that the sum of the numbers on each of the three hexagons and on each of the three ribs equals 57

    Learning discrete Hidden Markov Models from state distribution vectors

    Get PDF
    Hidden Markov Models (HMMs) are probabilistic models that have been widely applied to a number of fields since their inception in the late 1960’s. Computational Biology, Image Processing, and Signal Processing, are but a few of the application areas of HMMs. In this dissertation, we develop several new efficient learning algorithms for learning HMM parameters. First, we propose a new polynomial-time algorithm for supervised learning of the parameters of a first order HMM from a state probability distribution (SD) oracle. The SD oracle provides the learner with the state distribution vector corresponding to a query string. We prove the correctness of the algorithm and establish the conditions under which it is guaranteed to construct a model that exactly matches the oracle’s target HMM. We also conduct a simulation experiment to test the viability of the algorithm. Furthermore, the SD oracle is proven to be necessary for polynomial-time learning in the sense that the consistency problem for HMMs, where a training set of state distribution vectors such as those provided by the SD oracle is used but without the ability to query on arbitrary strings, is NP-complete. Next, we define helpful distributions on an instance set of strings for which polynomial-time HMM learning from state distribution vectors is feasible in the absence of an SD oracle and propose a new PAC-learning algorithm under helpful distribution for HMM parameters. The PAC-learning algorithm ensures with high probability that HMM parameters can be learned from training examples without asking queries. Furthermore, we propose a hybrid learning algorithm for approximating HMM parameters from a dataset composed of strings and their corresponding state distribution vectors, and provide supporting experimental data, which indicates our hybrid algorithm produces more accurate approximations than the existing method

    On the cross-validation bias due to unsupervised pre-processing

    Full text link
    Cross-validation is the de facto standard for predictive model evaluation and selection. In proper use, it provides an unbiased estimate of a model's predictive performance. However, data sets often undergo various forms of data-dependent preprocessing, such as mean-centering, rescaling, dimensionality reduction, and outlier removal. It is often believed that such preprocessing stages, if done in an unsupervised manner (that does not incorporate the class labels or response values) are generally safe to do prior to cross-validation. In this paper, we study three commonly-practiced preprocessing procedures prior to a regression analysis: (i) variance-based feature selection; (ii) grouping of rare categorical features; and (iii) feature rescaling. We demonstrate that unsupervised preprocessing procedures can, in fact, introduce a large bias into cross-validation estimates and potentially lead to sub-optimal model selection. This bias may be either positive or negative and its exact magnitude depends on all the parameters of the problem in an intricate manner. Further research is needed to understand the real-world impact of this bias across different application domains, particularly when dealing with low sample counts and high-dimensional data.Comment: 29 pages, 4 figures, 1 tabl

    Reflowing digital ink annotations

    Get PDF
    • …
    corecore