96 research outputs found
Exploring Topic-based Language Models for Effective Web Information Retrieval
The main obstacle for providing focused search is the relative opaqueness of search request -- searchers tend to express their complex information needs in only a couple of keywords. Our overall aim is to find out if, and how, topic-based language models can lead to more effective web information retrieval. In this paper we explore retrieval performance of a topic-based model that combines topical models with other language models based on cross-entropy. We first define our topical categories and train our topical models on the .GOV2 corpus by building parsimonious language models. We then test the topic-based model on TREC8 small Web data collection for ad-hoc search.Our experimental results show that the topic-based model outperforms the standard language model and parsimonious model
Evaluation of probabilistic models for word frequency and information retrieval
Volume 2 Issue 9 (September 2014
Knowledge-based Query Expansion in Real-Time Microblog Search
Since the length of microblog texts, such as tweets, is strictly limited to
140 characters, traditional Information Retrieval techniques suffer from the
vocabulary mismatch problem severely and cannot yield good performance in the
context of microblogosphere. To address this critical challenge, in this paper,
we propose a new language modeling approach for microblog retrieval by
inferring various types of context information. In particular, we expand the
query using knowledge terms derived from Freebase so that the expanded one can
better reflect users' search intent. Besides, in order to further satisfy
users' real-time information need, we incorporate temporal evidences into the
expansion method, which can boost recent tweets in the retrieval results with
respect to a given topic. Experimental results on two official TREC Twitter
corpora demonstrate the significant superiority of our approach over baseline
methods.Comment: 9 pages, 9 figure
Dating Texts without Explicit Temporal Cues
This paper tackles temporal resolution of documents, such as determining when
a document is about or when it was written, based only on its text. We apply
techniques from information retrieval that predict dates via language models
over a discretized timeline. Unlike most previous works, we rely {\it solely}
on temporal cues implicit in the text. We consider both document-likelihood and
divergence based techniques and several smoothing methods for both of them. Our
best model predicts the mid-point of individuals' lives with a median of 22 and
mean error of 36 years for Wikipedia biographies from 3800 B.C. to the present
day. We also show that this approach works well when training on such
biographies and predicting dates both for non-biographical Wikipedia pages
about specific years (500 B.C. to 2010 A.D.) and for publication dates of short
stories (1798 to 2008). Together, our work shows that, even in absence of
temporal extraction resources, it is possible to achieve remarkable temporal
locality across a diverse set of texts
Automatic tagging and geotagging in video collections and communities
Automatically generated tags and geotags hold great promise
to improve access to video collections and online communi-
ties. We overview three tasks offered in the MediaEval 2010
benchmarking initiative, for each, describing its use scenario, definition and the data set released. For each task, a reference algorithm is presented that was used within MediaEval 2010 and comments are included on lessons learned. The Tagging Task, Professional involves automatically matching episodes in a collection of Dutch television with subject labels drawn from the keyword thesaurus used by the archive staff. The Tagging Task, Wild Wild Web involves automatically predicting the tags that are assigned by users to their online videos. Finally, the Placing Task requires automatically assigning geo-coordinates to videos. The specification of each task admits the use of the full range of available information including user-generated metadata, speech recognition transcripts, audio, and visual features
- …