4,925 research outputs found
EveTAR: Building a Large-Scale Multi-Task Test Collection over Arabic Tweets
This article introduces a new language-independent approach for creating a
large-scale high-quality test collection of tweets that supports multiple
information retrieval (IR) tasks without running a shared-task campaign. The
adopted approach (demonstrated over Arabic tweets) designs the collection
around significant (i.e., popular) events, which enables the development of
topics that represent frequent information needs of Twitter users for which
rich content exists. That inherently facilitates the support of multiple tasks
that generally revolve around events, namely event detection, ad-hoc search,
timeline generation, and real-time summarization. The key highlights of the
approach include diversifying the judgment pool via interactive search and
multiple manually-crafted queries per topic, collecting high-quality
annotations via crowd-workers for relevancy and in-house annotators for
novelty, filtering out low-agreement topics and inaccessible tweets, and
providing multiple subsets of the collection for better availability. Applying
our methodology on Arabic tweets resulted in EveTAR , the first
freely-available tweet test collection for multiple IR tasks. EveTAR includes a
crawl of 355M Arabic tweets and covers 50 significant events for which about
62K tweets were judged with substantial average inter-annotator agreement
(Kappa value of 0.71). We demonstrate the usability of EveTAR by evaluating
existing algorithms in the respective tasks. Results indicate that the new
collection can support reliable ranking of IR systems that is comparable to
similar TREC collections, while providing strong baseline results for future
studies over Arabic tweets
Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-shot Learning
While billions of non-English speaking users rely on search engines every
day, the problem of ad-hoc information retrieval is rarely studied for
non-English languages. This is primarily due to a lack of data set that are
suitable to train ranking algorithms. In this paper, we tackle the lack of data
by leveraging pre-trained multilingual language models to transfer a retrieval
system trained on English collections to non-English queries and documents. Our
model is evaluated in a zero-shot setting, meaning that we use them to predict
relevance scores for query-document pairs in languages never seen during
training. Our results show that the proposed approach can significantly
outperform unsupervised retrieval techniques for Arabic, Chinese Mandarin, and
Spanish. We also show that augmenting the English training collection with some
examples from the target language can sometimes improve performance.Comment: ECIR 2020 (short
Concept-based Interactive Query Expansion Support Tool (CIQUEST)
This report describes a three-year project (2000-03) undertaken in the Information Studies
Department at The University of Sheffield and funded by Resource, The Council for
Museums, Archives and Libraries. The overall aim of the research was to provide user
support for query formulation and reformulation in searching large-scale textual resources
including those of the World Wide Web. More specifically the objectives were: to investigate
and evaluate methods for the automatic generation and organisation of concepts derived from
retrieved document sets, based on statistical methods for term weighting; and to conduct
user-based evaluations on the understanding, presentation and retrieval effectiveness of
concept structures in selecting candidate terms for interactive query expansion.
The TREC test collection formed the basis for the seven evaluative experiments conducted in
the course of the project. These formed four distinct phases in the project plan. In the first
phase, a series of experiments was conducted to investigate further techniques for concept
derivation and hierarchical organisation and structure. The second phase was concerned with
user-based validation of the concept structures. Results of phases 1 and 2 informed on the
design of the test system and the user interface was developed in phase 3. The final phase
entailed a user-based summative evaluation of the CiQuest system.
The main findings demonstrate that concept hierarchies can effectively be generated from
sets of retrieved documents and displayed to searchers in a meaningful way. The approach
provides the searcher with an overview of the contents of the retrieved documents, which in
turn facilitates the viewing of documents and selection of the most relevant ones. Concept
hierarchies are a good source of terms for query expansion and can improve precision. The
extraction of descriptive phrases as an alternative source of terms was also effective. With
respect to presentation, cascading menus were easy to browse for selecting terms and for
viewing documents. In conclusion the project dissemination programme and future work are
outlined
The relationship of word error rate to document ranking
This paper describes two experiments that examine the relationship of Word Error Rate (WER) of retrieved
spoken documents returned by a spoken document retrieval system. Previous work has demonstrated that
recognition errors do not significantly affect retrieval effectiveness but whether they will adversely affect
relevance judgement remains unclear. A user-based experiment measuring ability to judge relevance from
the recognised text presented in a retrieved result list was conducted. The results indicated that users were
capable of judging relevance accurately despite transcription errors. This lead an examination of the
relationship of WER in retrieved audio documents to their rank position when retrieved for a particular
query. Here it was shown that WER was somewhat lower for top ranked documents than it was for
documents retrieved further down the ranking, thereby indicating a possible explanation for the success of
the user experiment
A Vertical PRF Architecture for Microblog Search
In microblog retrieval, query expansion can be essential to obtain good
search results due to the short size of queries and posts. Since information in
microblogs is highly dynamic, an up-to-date index coupled with pseudo-relevance
feedback (PRF) with an external corpus has a higher chance of retrieving more
relevant documents and improving ranking. In this paper, we focus on the
research question:how can we reduce the query expansion computational cost
while maintaining the same retrieval precision as standard PRF? Therefore, we
propose to accelerate the query expansion step of pseudo-relevance feedback.
The hypothesis is that using an expansion corpus organized into verticals for
expanding the query, will lead to a more efficient query expansion process and
improved retrieval effectiveness. Thus, the proposed query expansion method
uses a distributed search architecture and resource selection algorithms to
provide an efficient query expansion process. Experiments on the TREC Microblog
datasets show that the proposed approach can match or outperform standard PRF
in MAP and NDCG@30, with a computational cost that is three orders of magnitude
lower.Comment: To appear in ICTIR 201
On the Impact of Entity Linking in Microblog Real-Time Filtering
Microblogging is a model of content sharing in which the temporal locality of
posts with respect to important events, either of foreseeable or unforeseeable
nature, makes applica- tions of real-time filtering of great practical
interest. We propose the use of Entity Linking (EL) in order to improve the
retrieval effectiveness, by enriching the representation of microblog posts and
filtering queries. EL is the process of recognizing in an unstructured text the
mention of relevant entities described in a knowledge base. EL of short pieces
of text is a difficult task, but it is also a scenario in which the information
EL adds to the text can have a substantial impact on the retrieval process. We
implement a start-of-the-art filtering method, based on the best systems from
the TREC Microblog track realtime adhoc retrieval and filtering tasks , and
extend it with a Wikipedia-based EL method. Results show that the use of EL
significantly improves over non-EL based versions of the filtering methods.Comment: 6 pages, 1 figure, 1 table. SAC 2015, Salamanca, Spain - April 13 -
17, 201
- …