307 research outputs found
Fast calculation of p-values for one-sided Kolmogorov-Smirnov type statistics
We present a method for computing exact p-values for a large family of
one-sided continuous goodness-of-fit statistics. This includes the higher
criticism statistic, one-sided weighted Kolmogorov-Smirnov statistics, and the
one-sided Berk-Jones statistics. For a sample size of 10,000, our method takes
merely 0.15 seconds to run and it scales to sample sizes in the hundreds of
thousands. This allows practitioners working on genome-wide association studies
and other high-dimensional analyses to use exact finite-sample computations
instead of statistic-specific approximation schemes.
Our work has other applications in statistics, including power analysis,
finding alpha-level thresholds for goodness-of-fit tests, and the construction
of confidence bands for the empirical distribution function. The algorithm is
based on a reduction to the boundary-crossing probability of a pure jump
process and is also applicable to fields outside of statistics, for example in
financial risk modeling.Comment: 22 pages, 3 figures. Supplementary code is included under the
crossprob and benchmarks directorie
Cube Mentalism
Our tour of multidimensional cubes begins with the marking of the eight comers of a 3-cube with the eight words HOT, POT, POD, HOD, HAD, HAT, PAT, and PAD. The figure below shows how these eight inherit the labels of the HOT-PAD).die where the letter H is opposite P, the letter O is opposite T and the letter T is opposite the letter D..
Farrell\u27s Spider
Puzzle game featured in Ivan Moscovich\u27s magnetic puzzle pack:
Place the 18 discs on the web so that the sum of the numbers on each of the three hexagons and on each of the three ribs equals 57
Learning discrete Hidden Markov Models from state distribution vectors
Hidden Markov Models (HMMs) are probabilistic models that have been widely applied to a number of fields since their inception in the late 1960’s. Computational Biology, Image Processing, and Signal Processing, are but a few of the application areas of HMMs. In this dissertation, we develop several new efficient learning algorithms for learning HMM parameters. First, we propose a new polynomial-time algorithm for supervised learning of the parameters of a first order HMM from a state probability distribution (SD) oracle. The SD oracle provides the learner with the state distribution vector corresponding to a query string. We prove the correctness of the algorithm and establish the conditions under which it is guaranteed to construct a model that exactly matches the oracle’s target HMM. We also conduct a simulation experiment to test the viability of the algorithm. Furthermore, the SD oracle is proven to be necessary for polynomial-time learning in the sense that the consistency problem for HMMs, where a training set of state distribution vectors such as those provided by the SD oracle is used but without the ability to query on arbitrary strings, is NP-complete. Next, we define helpful distributions on an instance set of strings for which polynomial-time HMM learning from state distribution vectors is feasible in the absence of an SD oracle and propose a new PAC-learning algorithm under helpful distribution for HMM parameters. The PAC-learning algorithm ensures with high probability that HMM parameters can be learned from training examples without asking queries. Furthermore, we propose a hybrid learning algorithm for approximating HMM parameters from a dataset composed of strings and their corresponding state distribution vectors, and provide supporting experimental data, which indicates our hybrid algorithm produces more accurate approximations than the existing method
On the cross-validation bias due to unsupervised pre-processing
Cross-validation is the de facto standard for predictive model evaluation and
selection. In proper use, it provides an unbiased estimate of a model's
predictive performance. However, data sets often undergo various forms of
data-dependent preprocessing, such as mean-centering, rescaling, dimensionality
reduction, and outlier removal. It is often believed that such preprocessing
stages, if done in an unsupervised manner (that does not incorporate the class
labels or response values) are generally safe to do prior to cross-validation.
In this paper, we study three commonly-practiced preprocessing procedures prior
to a regression analysis: (i) variance-based feature selection; (ii) grouping
of rare categorical features; and (iii) feature rescaling. We demonstrate that
unsupervised preprocessing procedures can, in fact, introduce a large bias into
cross-validation estimates and potentially lead to sub-optimal model selection.
This bias may be either positive or negative and its exact magnitude depends on
all the parameters of the problem in an intricate manner. Further research is
needed to understand the real-world impact of this bias across different
application domains, particularly when dealing with low sample counts and
high-dimensional data.Comment: 29 pages, 4 figures, 1 tabl
Estado y sociedad civil en el Gran Buenos Aires. Cambio y tensiones en las nuevas relaciones de gobierno local
- …