118 research outputs found

    Online Adaptor Grammars with Hybrid Inference

    Get PDF
    Adaptor grammars are a flexible, powerful formalism for defining nonparametric, un-supervised models of grammar productions. This flexibility comes at the cost of expensive inference. We address the difficulty of infer-ence through an online algorithm which uses a hybrid of Markov chain Monte Carlo and variational inference. We show that this in-ference strategy improves scalability without sacrificing performance on unsupervised word segmentation and topic modeling tasks.

    Models, Inference, and Implementation for Scalable Probabilistic Models of Text

    Get PDF
    Unsupervised probabilistic Bayesian models are powerful tools for statistical analysis, especially in the area of information retrieval, document analysis and text processing. Despite their success, unsupervised probabilistic Bayesian models are often slow in inference due to inter-entangled mutually dependent latent variables. In addition, the parameter space of these models is usually very large. As the data from various different media sources--for example, internet, electronic books, digital films, etc--become widely accessible, lack of scalability for these unsupervised probabilistic Bayesian models becomes a critical bottleneck. The primary focus of this dissertation is to speed up the inference process in unsupervised probabilistic Bayesian models. There are two common solutions to scale the algorithm up to large data: parallelization or streaming. The former achieves scalability by distributing the data and the computation to multiple machines. The latter assumes data come in a stream and updates the model gradually after seeing each data observation. It is able to scale to larger datasets because it usually takes only one pass over the entire data. In this dissertation, we examine both approaches. We first demonstrate the effectiveness of the parallelization approach on a class of unsupervised Bayesian models--topic models, which are exemplified by latent Dirichlet allocation (LDA). We propose a fast parallel implementation using variational inference on the MapRe- duce framework, referred to as Mr. LDA. We show that parallelization enables topic models to handle significantly larger datasets. We further show that our implementation--unlike highly tuned and specialized implementations--is easily extensible. We demonstrate two extensions possible with this scalable framework: 1) informed priors to guide topic discovery and 2) extracting topics from a multilingual corpus. We propose polylingual tree-based topic models to infer topics in multilingual corpora. We then propose three different inference methods to infer the latent variables. We examine the effectiveness of different inference methods on the task of machine translation in which we use the proposed model to extract domain knowledge that considers both source and target languages. We apply it on a large collection of aligned Chinese-English sentences and show that our model yields significant improvement on BLEU score over strong baselines. Other than parallelization, another approach to deal with scalability is to learn parameters in an online streaming setting. Although many online algorithms have been proposed for LDA, they all overlook a fundamental but challenging problem-- the vocabulary is constantly evolving over time. To address this problem, we propose an online LDA with infinite vocabulary--infvoc LDA. We derive online hybrid inference for our model and propose heuristics to dynamically order, expand, and contract the set of words in our vocabulary. We show that our algorithm is able to discover better topics by incorporating new words into the vocabulary and constantly refining the topics over time. In addition to LDA, we also show generality of the online hybrid inference framework by applying it to adaptor grammars, which are a broader class of models subsuming LDA. With proper grammar rules, it simplifies to the exact LDA model, however, it provides more flexibility to alter or extend LDA with different grammar rules. We develop online hybrid inference for adaptor grammar, and show that our method discovers high-quality structure more quickly than both MCMC and variational inference methods

    A summary of the 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition

    Get PDF
    We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding zero resource (unsupervised) speech technologies and related models of early language acquisition. Centered around the tasks of phonetic and lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in the absence of supervision, and evaluate the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations. Finally, we present two strategies for integrating zero resource techniques into supervised settings, demonstrating the potential of unsupervised methods to improve mainstream technologies.5 page(s

    A computational framework of human causal generalization

    Get PDF
    How do people decide how general a causal relationship is, in terms of the entities or situations it applies to? How can people make these difficult judgments in a fast, efficient way? To address these questions, I designed a novel online experiment interface that systematically measures how people generalize causal relationships, and developed a computational modeling framework that combines program induction (about the hidden causal laws) with non-parametric category inference (about their domains of influence) to account for unique patterns in human causal generalization. In particular, by introducing adaptor grammars to standard Bayesian-symbolic models, this framework formalizes conceptual bootstrapping as a general online inference algorithm that gives rise to compositional causal concepts. Chapter 2 investigates one-shot causal generalization, where I find that participants’ inferences are shaped by the order of the generalization questions they are asked. Chapter 3 looks into few-shot cases, and finds an asymmetry in the formation of causal categories: participants preferentially identify causal laws with features of the agent objects rather than recipients, but this asymmetry disappears when visual cues to causal agency are challenged. The proposed modeling approach can explain both the generalizationorder effect and the causal asymmetry, outperforming a naïve Bayesian account while providing a computationally plausible mechanism for real-world causal generalization. Chapter 4 further extends this framework with adaptor grammars, using a dynamic conceptual repertoire that is enriched over time, allowing the model to cache and later reuse elements of earlier insights. This model predicts systematically different learned concepts when the same evidence is processed in different orders, and across four experiments people’s learning outcomes indeed closely resembled this model’s, differing significantly from alternative accounts

    Investigating Language Impact in Bilingual Approaches for Computational Language Documentation

    Full text link
    For endangered languages, data collection campaigns have to accommodate the challenge that many of them are from oral tradition, and producing transcriptions is costly. Therefore, it is fundamental to translate them into a widely spoken language to ensure interpretability of the recordings. In this paper we investigate how the choice of translation language affects the posterior documentation work and potential automatic approaches which will work on top of the produced bilingual corpus. For answering this question, we use the MaSS multilingual speech corpus (Boito et al., 2020) for creating 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment. Our results highlight that the choice of language for translation influences the word segmentation performance, and that different lexicons are learned by using different aligned translations. Lastly, this paper proposes a hybrid approach for bilingual word segmentation, combining boundary clues extracted from a non-parametric Bayesian model (Goldwater et al., 2009a) with the attentional word segmentation neural model from Godard et al. (2018). Our results suggest that incorporating these clues into the neural models' input representation increases their translation and alignment quality, specially for challenging language pairs.Comment: Accepted to 1st Joint SLTU and CCURL Worksho

    Hierarchical Bayesian Nonparametric Models for Power-Law Sequences

    Get PDF
    Sequence data that exhibits power-law behavior in its marginal and conditional distributions arises frequently from natural processes, with natural language text being a prominent example. We study probabilistic models for such sequences based on a hierarchical non-parametric Bayesian prior, develop inference and learning procedures for making these models useful in practice and applicable to large, real-world data sets, and empirically demonstrate their excellent predictive performance. In particular, we consider models based on the infinite-depth variant of the hierarchical Pitman-Yor process (HPYP) language model [Teh, 2006b] known as the Sequence Memoizer, as well as Sequence Memoizer-based cache language models and hybrid models combining the HPYP with neural language models. We empirically demonstrate that these models performwell on languagemodelling and data compression tasks