Search CORE

3,299 research outputs found

Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-shot Learning

Author: B Mitra
C Carpineto
C Peters
G Amati
KD Onal
M Braschler
M Johnson
P Yang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/12/2019
Field of study

While billions of non-English speaking users rely on search engines every day, the problem of ad-hoc information retrieval is rarely studied for non-English languages. This is primarily due to a lack of data set that are suitable to train ranking algorithms. In this paper, we tackle the lack of data by leveraging pre-trained multilingual language models to transfer a retrieval system trained on English collections to non-English queries and documents. Our model is evaluated in a zero-shot setting, meaning that we use them to predict relevance scores for query-document pairs in languages never seen during training. Our results show that the proposed approach can significantly outperform unsupervised retrieval techniques for Arabic, Chinese Mandarin, and Spanish. We also show that augmenting the English training collection with some examples from the target language can sometimes improve performance.Comment: ECIR 2020 (short

arXiv.org e-Print Archive

Crossref

The SPIRIT collection: an overview of a large web collection

Author: Cacheda F.
Hideo Joho
Mark Sanderson
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2004
Field of study

A large scale collection of web pages has been essential for research in information retrieval and related areas. This paper provides an overview of a large web collection used in the SPIRIT project for the design and testing of spatially-aware retrieval systems. Several statistics are derived and presented to show the characteristics of the collection

CiteSeerX

Crossref

White Rose Research Online

The relationship of word error rate to document ranking

Author: Mang Shou X.
Sanderson M.
Tuffs N.
Publication venue
Publication date: 01/01/2003
Field of study

This paper describes two experiments that examine the relationship of Word Error Rate (WER) of retrieved spoken documents returned by a spoken document retrieval system. Previous work has demonstrated that recognition errors do not significantly affect retrieval effectiveness but whether they will adversely affect relevance judgement remains unclear. A user-based experiment measuring ability to judge relevance from the recognised text presented in a retrieved result list was conducted. The results indicated that users were capable of judging relevance accurately despite transcription errors. This lead an examination of the relationship of WER in retrieved audio documents to their rank position when retrieved for a particular query. Here it was shown that WER was somewhat lower for top ranked documents than it was for documents retrieved further down the ranking, thereby indicating a possible explanation for the success of the user experiment

CiteSeerX

White Rose Research Online

Entity Ranking on Graphs: Studies on Expert Finding

Author: Hiemstra D.
Rode H.
Serdyukov P.
Zaragoza H.
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2007
Field of study

Todays web search engines try to offer services for finding various information in addition to simple web pages, like showing locations or answering simple fact queries. Understanding the association of named entities and documents is one of the key steps towards such semantic search tasks. This paper addresses the ranking of entities and models it in a graph-based relevance propagation framework. In particular we study the problem of expert finding as an example of an entity ranking task. Entity containment graphs are introduced that represent the relationship between text fragments on the one hand and their contained entities on the other hand. The paper shows how these graphs can be used to propagate relevance information from the pre-ranked text fragments to their entities. We use this propagation framework to model existing approaches to expert finding based on the entity's indegree and extend them by recursive relevance propagation based on a probabilistic random walk over the entity containment graphs. Experiments on the TREC expert search task compare the retrieval performance of the different graph and propagation models

CiteSeerX

Radboud Repository

University of Twente Research Information

Interactive information retrieval

Author: Allan
Barry
Bates
Beaulieu
Beaulieu
Belkin
Belkin
Bhavnani
Blair
Borgman
Borgman
Brajnik
Broder
Buyukkokten
Byström
Campbell
Case
Chen
Cove
Crestani
Crouch
Downie
Dumais
Eastman
Efthimiadis
Ellis
Ellis
Fidel
Ford
Ford
Foster
Fox
Hansen
Harper
Hearst
Hearst
Hearst
Heinström
Hill
Ingwersen
Ingwersen
Jansen
Jansen
Jones
Jones
Kang
Kelly
Kelly
Kim
Konstan
Kruschwitz
Kuhlthau
Legg
Lin
Lin
Lorigo
Lynch
López-Ostenero
Maña-López
Niemi
Norman
Over
Pirkola
Pu
Radev
Reid
Reid
Riedl
Rieh
Robertson
Rosenfeld
Roussinov
Ruthven
Ruthven
Savolainen
Shipman
Shneiderman
Sihvonen
Slone
Smeaton
Spink
Spink
Spink
Spink
Spink
Spink
Spärck Jones
Spärck Jones
Sweeney
Tombros
Tombros
Toms
Topi
Topi
Vakkari
Vakkari
Vakkari
Vakkari
van der Eijk
Vechtomova
Voorhees
White
White
White
White
Wiesman
Wu
Xie
Publication venue: 'Wiley'
Publication date: 01/11/2008
Field of study

Crossref

University of Strathclyde Institutional Repository

Concept-based Interactive Query Expansion Support Tool (CIQUEST)

Author: Beaulieu M.
Joho H.
Sanderson M.
Publication venue: Resource: The Council for Museums, Archives and Libraries
Publication date: 01/01/2003
Field of study

This report describes a three-year project (2000-03) undertaken in the Information Studies Department at The University of Sheffield and funded by Resource, The Council for Museums, Archives and Libraries. The overall aim of the research was to provide user support for query formulation and reformulation in searching large-scale textual resources including those of the World Wide Web. More specifically the objectives were: to investigate and evaluate methods for the automatic generation and organisation of concepts derived from retrieved document sets, based on statistical methods for term weighting; and to conduct user-based evaluations on the understanding, presentation and retrieval effectiveness of concept structures in selecting candidate terms for interactive query expansion. The TREC test collection formed the basis for the seven evaluative experiments conducted in the course of the project. These formed four distinct phases in the project plan. In the first phase, a series of experiments was conducted to investigate further techniques for concept derivation and hierarchical organisation and structure. The second phase was concerned with user-based validation of the concept structures. Results of phases 1 and 2 informed on the design of the test system and the user interface was developed in phase 3. The final phase entailed a user-based summative evaluation of the CiQuest system. The main findings demonstrate that concept hierarchies can effectively be generated from sets of retrieved documents and displayed to searchers in a meaningful way. The approach provides the searcher with an overview of the contents of the retrieved documents, which in turn facilitates the viewing of documents and selection of the most relevant ones. Concept hierarchies are a good source of terms for query expansion and can improve precision. The extraction of descriptive phrases as an alternative source of terms was also effective. With respect to presentation, cascading menus were easy to browse for selecting terms and for viewing documents. In conclusion the project dissemination programme and future work are outlined

White Rose Research Online

Search of spoken documents retrieves well recognized transcripts

Author: Sanderson M.
Shou X.M.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/06/2007
Field of study

This paper presents a series of analyses and experiments on spoken document retrieval systems: search engines that retrieve transcripts produced by speech recognizers. Results show that transcripts that match queries well tend to be recognized more accurately than transcripts that match a query less well. This result was described in past literature, however, no study or explanation of the effect has been provided until now. This paper provides such an analysis showing a relationship between word error rate and query length. The paper expands on past research by increasing the number of recognitions systems that are tested as well as showing the effect in an operational speech retrieval system. Potential future lines of enquiry are also described

White Rose Research Online