Search CORE

4 research outputs found

An Approach toward Register Classification of Book Samples in the Balanced Corpus of Contemporary Written Japanese

Author: Kashino Wakako
Okumura Manabu
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Data-driven Classification of Linguistic Styles in Spoken Dialogues

Author: Thomas Portele
Publication venue
Publication date: 01/01/2002
Field of study

Language users have individual linguistic styles. A spoken dialogue system may benefit from adapting to the linguistic style of a user in input analysis and output generation. To investigate the possibility to automatically classify speakers according to their linguistic style three corpora of spoken dialogues were analyzed. Several numerical parameters were computed for every speaker. These parameters were reduced to linguistically interpretable components by means of a principal component analysis. Classes were established from these components by cluster analysis. Unseen input was classified by trained neural networks with varying error rates depending on corpus type. A first investigation in using special language models for speaker classes was carried out

CiteSeerX

Crossref

Data-driven Classification of Linguistic Styles in Spoken Dialogues

Author: Thomas Portele
Publication venue
Publication date
Field of study

CiteSeerX

Topic Modeling with Structured Priors for Text-Driven Science

Author: Paul Michael John
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 15/12/2016
Field of study

Many scientific disciplines are being revolutionized by the explosion of public data on the web and social media, particularly in health and social sciences. For instance, by analyzing social media messages, we can instantly measure public opinion, understand population behaviors, and monitor events such as disease outbreaks and natural disasters. Taking advantage of these data sources requires tools that can make sense of massive amounts of unstructured and unlabeled text. Topic models, statistical models that posit low-dimensional representations of data, can uncover interesting latent structure in large text datasets and are popular tools for automatically identifying prominent themes in text. For example, prominent themes of discussion in social media might include politics and health. To be useful in scientific analyses, topic models must learn interpretable patterns that accurately correspond to real-world concepts of interest. This thesis will introduce topic models that can encode additional structures such as factorizations, hierarchies, and correlations of topics, and can incorporate supervision and domain knowledge. For example, topics about elections and Congressional legislation are related to each other (as part of a broader topic of “politics”), and certain political topics have partisan associations. These types of relations between topics can be modeled by formulating the Bayesian priors over parameters as functions of underlying “components,” which can be constrained in various ways to induce different structures. This approach is first introduced through a topic model called factorial LDA, which models a factorized structure in which topics are conceptually arranged in multiple dimensions. Factorial LDA can be used to model multiple types of information, for example topic and political ideology. We then introduce a family of structured-prior topic models called SPRITE, which creates a unifying representation that generalizes factorial LDA as well as other existing topic models, and creates a powerful framework for building new models. This thesis will also show how these topic models can be used in various scientific applications, such as extracting medical information from forums, measuring healthcare quality from patient reviews, and monitoring public opinion in social media

JScholarship