3 research outputs found

    A Study on Richer Syntactic Dependencies for Structured Language Modeling

    No full text
    We study the impact of richer syntactic dependencies on the performance of the structured language model (SLM) along three dimensions: parsing accuracy (LP/LR), perplexity (PPL) and worderror -rate (WER, N-best re-scoring). We show that our models achieve an improvement in LP/LR, PPL and/or WER over the reported baseline results using the SLM on the UPenn Treebank and Wall Street Journal (WSJ) corpora, respectively. Analysis of parsing performance shows correlation between the quality of the parser (as measured by precision /recall) and the language model performance (PPL and WER). A remarkable fact is that the enriched SLM outperforms the baseline 3-gram model in terms of WER by 10% when used in isolation as a second pass (N-best re-scoring) language model

    A study on richer syntactic dependencies for structured language modeling

    No full text
    The paper investigates the use of richer syntactic dependencies in the structured language model (SLM). We present two simple methods of enriching the dependencies in the syntactic parse trees used for intializing the SLM. We evaluate the impact of both methods on the perplexity (PPL) and word-error-rate (WER, N-best rescoring) performance of the SLM. We show that the new model achieves an improvement in PPL and WER over the baseline results reported using the SLM on the UPenn Treebank and Wall Street Journal (WSJ) corpora, respectively. 1

    Topic Modeling with Structured Priors for Text-Driven Science

    Get PDF
    Many scientific disciplines are being revolutionized by the explosion of public data on the web and social media, particularly in health and social sciences. For instance, by analyzing social media messages, we can instantly measure public opinion, understand population behaviors, and monitor events such as disease outbreaks and natural disasters. Taking advantage of these data sources requires tools that can make sense of massive amounts of unstructured and unlabeled text. Topic models, statistical models that posit low-dimensional representations of data, can uncover interesting latent structure in large text datasets and are popular tools for automatically identifying prominent themes in text. For example, prominent themes of discussion in social media might include politics and health. To be useful in scientific analyses, topic models must learn interpretable patterns that accurately correspond to real-world concepts of interest. This thesis will introduce topic models that can encode additional structures such as factorizations, hierarchies, and correlations of topics, and can incorporate supervision and domain knowledge. For example, topics about elections and Congressional legislation are related to each other (as part of a broader topic of “politics”), and certain political topics have partisan associations. These types of relations between topics can be modeled by formulating the Bayesian priors over parameters as functions of underlying “components,” which can be constrained in various ways to induce different structures. This approach is first introduced through a topic model called factorial LDA, which models a factorized structure in which topics are conceptually arranged in multiple dimensions. Factorial LDA can be used to model multiple types of information, for example topic and political ideology. We then introduce a family of structured-prior topic models called SPRITE, which creates a unifying representation that generalizes factorial LDA as well as other existing topic models, and creates a powerful framework for building new models. This thesis will also show how these topic models can be used in various scientific applications, such as extracting medical information from forums, measuring healthcare quality from patient reviews, and monitoring public opinion in social media
    corecore