13,450 research outputs found

    Beyond Word N-Grams

    Full text link
    We describe, analyze, and evaluate experimentally a new probabilistic model for word-sequence prediction in natural language based on prediction suffix trees (PSTs). By using efficient data structures, we extend the notion of PST to unbounded vocabularies. We also show how to use a Bayesian approach based on recursive priors over all possible PSTs to efficiently maintain tree mixtures. These mixtures have provably and practically better performance than almost any single model. We evaluate the model on several corpora. The low perplexity achieved by relatively small PST mixture models suggests that they may be an advantageous alternative, both theoretically and practically, to the widely used n-gram models.Comment: 15 pages, one PostScript figure, uses psfig.sty and fullname.sty. Revised version of a paper in the Proceedings of the Third Workshop on Very Large Corpora, MIT, 199

    Bayesian model averaging over tree-based dependence structures for multivariate extremes

    Full text link
    Describing the complex dependence structure of extreme phenomena is particularly challenging. To tackle this issue we develop a novel statistical algorithm that describes extremal dependence taking advantage of the inherent hierarchical dependence structure of the max-stable nested logistic distribution and that identifies possible clusters of extreme variables using reversible jump Markov chain Monte Carlo techniques. Parsimonious representations are achieved when clusters of extreme variables are found to be completely independent. Moreover, we significantly decrease the computational complexity of full likelihood inference by deriving a recursive formula for the nested logistic model likelihood. The algorithm performance is verified through extensive simulation experiments which also compare different likelihood procedures. The new methodology is used to investigate the dependence relationships between extreme concentration of multiple pollutants in California and how these pollutants are related to extreme weather conditions. Overall, we show that our approach allows for the representation of complex extremal dependence structures and has valid applications in multivariate data analysis, such as air pollution monitoring, where it can guide policymaking

    A Multiscale Approach for Statistical Characterization of Functional Images

    Get PDF
    Increasingly, scientific studies yield functional image data, in which the observed data consist of sets of curves recorded on the pixels of the image. Examples include temporal brain response intensities measured by fMRI and NMR frequency spectra measured at each pixel. This article presents a new methodology for improving the characterization of pixels in functional imaging, formulated as a spatial curve clustering problem. Our method operates on curves as a unit. It is nonparametric and involves multiple stages: (i) wavelet thresholding, aggregation, and Neyman truncation to effectively reduce dimensionality; (ii) clustering based on an extended EM algorithm; and (iii) multiscale penalized dyadic partitioning to create a spatial segmentation. We motivate the different stages with theoretical considerations and arguments, and illustrate the overall procedure on simulated and real datasets. Our method appears to offer substantial improvements over monoscale pixel-wise methods. An Appendix which gives some theoretical justifications of the methodology, computer code, documentation and dataset are available in the online supplements
    corecore