691 research outputs found
Investigating the relationship between language model perplexity and IR precision-recall measures
An empirical study has been conducted investigating the relationship between the performance of an aspect based language model in terms of perplexity and the corresponding information retrieval performance obtained. It is observed, on the corpora considered, that the perplexity of the language model has a systematic relationship with the achievable precision recall performance though it is not statistically significant
On a Topic Model for Sentences
Probabilistic topic models are generative models that describe the content of
documents by discovering the latent topics underlying them. However, the
structure of the textual input, and for instance the grouping of words in
coherent text spans such as sentences, contains much information which is
generally lost with these models. In this paper, we propose sentenceLDA, an
extension of LDA whose goal is to overcome this limitation by incorporating the
structure of the text in the generative and inference processes. We illustrate
the advantages of sentenceLDA by comparing it with LDA using both intrinsic
(perplexity) and extrinsic (text classification) evaluation tasks on different
text collections
Augmenting Latent Dirichlet Allocation and Rank Threshold Detection with Ontologies
In an ever-increasing data rich environment, actionable information must be extracted, filtered, and correlated from massive amounts of disparate often free text sources. The usefulness of the retrieved information depends on how we accomplish these steps and present the most relevant information to the analyst. One method for extracting information from free text is Latent Dirichlet Allocation (LDA), a document categorization technique to classify documents into cohesive topics. Although LDA accounts for some implicit relationships such as synonymy (same meaning) it often ignores other semantic relationships such as polysemy (different meanings), hyponym (subordinate), meronym (part of), and troponomys (manner). To compensate for this deficiency, we incorporate explicit word ontologies, such as WordNet, into the LDA algorithm to account for various semantic relationships. Experiments over the 20 Newsgroups, NIPS, OHSUMED, and IED document collections demonstrate that incorporating such knowledge improves perplexity measure over LDA alone for given parameters. In addition, the same ontology augmentation improves recall and precision results for user queries
Adaptive query-based sampling of distributed collections
As part of a Distributed Information Retrieval system a de-scription of each remote information resource, archive or repository is usually stored centrally in order to facilitate resource selection. The ac-quisition ofprecise resourcedescriptionsistherefore animportantphase in Distributed Information Retrieval, as the quality of such represen-tations will impact on selection accuracy, and ultimately retrieval per-formance. While Query-Based Sampling is currently used for content discovery of uncooperative resources, the application of this technique is dependent upon heuristic guidelines to determine when a sufficiently accurate representation of each remote resource has been obtained. In this paper we address this shortcoming by using the Predictive Likelihood to provide both an indication of thequality of an acquired resource description estimate, and when a sufficiently good representation of a resource hasbeen obtained during Query-Based Sampling
Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span
BACKGROUND: The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. Here, the potential of such modeling is demonstrated by examining the 5,225 free-text items in the Caenorhabditis Genetic Center (CGC) Bibliography using techniques from statistical information retrieval. Items in the CGC biomedical text corpus were modeled using the Latent Dirichlet Allocation (LDA) model. LDA is a hierarchical Bayesian model which represents a document as a random mixture over latent topics; each topic is characterized by a distribution over words. RESULTS: An LDA model estimated from CGC items had better predictive performance than two standard models (unigram and mixture of unigrams) trained using the same data. To illustrate the practical utility of LDA models of biomedical corpora, a trained CGC LDA model was used for a retrospective study of nematode genes known to be associated with life span modification. Corpus-, document-, and word-level LDA parameters were combined with terms from the Gene Ontology to enhance the explanatory value of the CGC LDA model, and to suggest additional candidates for age-related genes. A novel, pairwise document similarity measure based on the posterior distribution on the topic simplex was formulated and used to search the CGC database for "homologs" of a "query" document discussing the life span-modifying clk-2 gene. Inspection of these document homologs enabled and facilitated the production of hypotheses about the function and role of clk-2. CONCLUSION: Like other graphical models for genetic, genomic and other types of biological data, LDA provides a method for extracting unanticipated insights and generating predictions amenable to subsequent experimental validation
Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling
Recording university lectures through lecture capture systems is increasingly common.
However, a single continuous audio recording is often unhelpful for users, who may wish
to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set
of recordings.
A transcript of the recording can enable faster navigation and searching. Automatic speech
recognition (ASR) technologies may be used to create automated transcripts, to avoid the
significant time and cost involved in manual transcription.
Low accuracy of ASR-generated transcripts may however limit their usefulness. In
particular, ASR systems optimized for general speech recognition may not recognize the
many technical or discipline-specific words occurring in university lectures. To improve
the usefulness of ASR transcripts for the purposes of information retrieval (search) and
navigating within recordings, the lexicon and language model used by the ASR engine may
be dynamically adapted for the topic of each lecture.
A prototype is presented which uses the English Wikipedia as a semantically dense, large
language corpus to generate a custom lexicon and language model for each lecture from a
small set of keywords. Two strategies for extracting a topic-specific subset of Wikipedia
articles are investigated: a naïve crawler which follows all article links from a set of seed
articles produced by a Wikipedia search from the initial keywords, and a refinement which
follows only links to articles sufficiently similar to the parent article. Pair-wise article
similarity is computed from a pre-computed vector space model of Wikipedia article term
scores generated using latent semantic indexing.
The CMU Sphinx4 ASR engine is used to generate transcripts from thirteen recorded
lectures from Open Yale Courses, using the English HUB4 language model as a reference
and the two topic-specific language models generated for each lecture from Wikipedia
Listener Modeling and Context-aware Music Recommendation Based on Country Archetypes
Music preferences are strongly shaped by the cultural and socio-economic
background of the listener, which is reflected, to a considerable extent, in
country-specific music listening profiles. Previous work has already identified
several country-specific differences in the popularity distribution of music
artists listened to. In particular, what constitutes the "music mainstream"
strongly varies between countries. To complement and extend these results, the
article at hand delivers the following major contributions: First, using
state-of-the-art unsupervised learning techniques, we identify and thoroughly
investigate (1) country profiles of music preferences on the fine-grained level
of music tracks (in contrast to earlier work that relied on music preferences
on the artist level) and (2) country archetypes that subsume countries sharing
similar patterns of listening preferences. Second, we formulate four user
models that leverage the user's country information on music preferences. Among
others, we propose a user modeling approach to describe a music listener as a
vector of similarities over the identified country clusters or archetypes.
Third, we propose a context-aware music recommendation system that leverages
implicit user feedback, where context is defined via the four user models. More
precisely, it is a multi-layer generative model based on a variational
autoencoder, in which contextual features can influence recommendations through
a gating mechanism. Fourth, we thoroughly evaluate the proposed recommendation
system and user models on a real-world corpus of more than one billion
listening records of users around the world (out of which we use 369 million in
our experiments) and show its merits vis-a-vis state-of-the-art algorithms that
do not exploit this type of context information.Comment: 30 pages, 3 tables, 12 figure
Computational Methods for Analyzing Health News Coverage
Researchers that investigate the media's coverage of health have historically relied on keyword searches to retrieve relevant health news coverage, and manual content analysis methods to categorize and score health news text. These methods are problematic. Manual content analysis methods are labor intensive, time consuming, and inherently subjective because they rely on human coders to review, score, and annotate content. Retrieving relevant health news coverage using keywords can be challenging because manually defining an optimal keyword query, especially for complex health topics and media analysis concepts, can be very difficult, and the optimal query may vary based on when the news was published, the type of news published, and the target audience of the news coverage. This dissertation research investigated computational methods that can assist health news investigators by facilitating these tasks. The first step was to identify the research methods currently used by investigators, and the research questions and health topics researchers tend to investigate. To capture this information an extensive literature review of health news analyses was performed. No literature review of this type and scope could be found in the research literature. This review confirmed that researchers overwhelmingly rely on manual content analysis methods to analyze the text of health news coverage, and on the use of keyword searching to identify relevant health news articles. To investigate the use of computational methods for facilitating these tasks, classifiers that categorize health news on relevance to the topic of obesity, and on their news framing were developed and evaluated. The obesity news classifier developed for this dissertation outperformed alternative methods, including searching based on keyword appearance. Classifying on the framing of health news proved to be a more difficult task. The news framing classifiers performed well, but the results suggest that the underlying features of health news coverage that contribute to the framing of health news are a richer and more useful source of framing information rather than binary news framing classifications. The third step in this dissertation was to use the findings of the literature review and the classifier studies to design the SalientHealthNews system. The purpose of SalientHealthNews is to facilitate the use of computational and data mining techniques for health news investigation, hypothesis testing, and hypothesis generation. To illustrate the use of SalientHealthNews' features and algorithms, it was used to generate preliminary data for a study investigating how framing features vary in health and obesity news coverage that discusses populations with health disparities. This research contributes to the study of the media's coverage of health by providing a detailed description of how health news is studied and what health news topics are investigated, then by demonstrating that certain tasks performed in health news analyses can be facilitated by computational methods, and lastly by describing the design of a system that will facilitate the use of computational and data mining techniques for the study of health news. These contributions should further the study of health news by expanding the methods available to health news analysis researchers. This will lead to researchers being better equipped to accurately and consistently evaluate the media's coverage of health. Knowledge of the quality of health news coverage should in turn lead to better informed health journalists, healthcare providers, and healthcare consumers, ultimately improving individual and public health
- …