13,450 research outputs found
Beyond Word N-Grams
We describe, analyze, and evaluate experimentally a new probabilistic model
for word-sequence prediction in natural language based on prediction suffix
trees (PSTs). By using efficient data structures, we extend the notion of PST
to unbounded vocabularies. We also show how to use a Bayesian approach based on
recursive priors over all possible PSTs to efficiently maintain tree mixtures.
These mixtures have provably and practically better performance than almost any
single model. We evaluate the model on several corpora. The low perplexity
achieved by relatively small PST mixture models suggests that they may be an
advantageous alternative, both theoretically and practically, to the widely
used n-gram models.Comment: 15 pages, one PostScript figure, uses psfig.sty and fullname.sty.
Revised version of a paper in the Proceedings of the Third Workshop on Very
Large Corpora, MIT, 199
Bayesian model averaging over tree-based dependence structures for multivariate extremes
Describing the complex dependence structure of extreme phenomena is
particularly challenging. To tackle this issue we develop a novel statistical
algorithm that describes extremal dependence taking advantage of the inherent
hierarchical dependence structure of the max-stable nested logistic
distribution and that identifies possible clusters of extreme variables using
reversible jump Markov chain Monte Carlo techniques. Parsimonious
representations are achieved when clusters of extreme variables are found to be
completely independent. Moreover, we significantly decrease the computational
complexity of full likelihood inference by deriving a recursive formula for the
nested logistic model likelihood. The algorithm performance is verified through
extensive simulation experiments which also compare different likelihood
procedures. The new methodology is used to investigate the dependence
relationships between extreme concentration of multiple pollutants in
California and how these pollutants are related to extreme weather conditions.
Overall, we show that our approach allows for the representation of complex
extremal dependence structures and has valid applications in multivariate data
analysis, such as air pollution monitoring, where it can guide policymaking
A Multiscale Approach for Statistical Characterization of Functional Images
Increasingly, scientific studies yield functional image data, in which the observed data consist of sets of curves recorded on the pixels of the image. Examples include temporal brain response intensities measured by fMRI and NMR frequency spectra measured at each pixel. This article presents a new methodology for improving the characterization of pixels in functional imaging, formulated as a spatial curve clustering problem. Our method operates on curves as a unit. It is nonparametric and involves multiple stages: (i) wavelet thresholding, aggregation, and Neyman truncation to effectively reduce dimensionality; (ii) clustering based on an extended EM algorithm; and (iii) multiscale penalized dyadic partitioning to create a spatial segmentation. We motivate the different stages with theoretical considerations and arguments, and illustrate the overall procedure on simulated and real datasets. Our method appears to offer substantial improvements over monoscale pixel-wise methods. An Appendix which gives some theoretical justifications of the methodology, computer code, documentation and dataset are available in the online supplements
- …