Metadata harvesting for content-based distributed information retrieval

Anan; Bailey; Bowman; Callan; Callan; Callan; Callan; Callan; Callan; Carmel; Chou; Craswell; Crow; DCMI; de Sompel; de Sompel; Dijk; French; Gatenby; Gravano; Joint Information Systems Committee; Lagoze; Lagoze; Lagoze; Larson; Liu; Lu; Lu; Lu; Lynch; Nelson; Nottelmann; Paepcke; Sanderson; Simeoni; Simon; Simons; Suleman; van der Kuil; Warner; Witten; Yang; Z39.50 Maintenance Agency

unknown

Metadata harvesting for content-based distributed information retrieval

Authors: Anan
Bailey
Bowman
Callan
Callan
Callan
Callan
Callan
Callan
Carmel
Chou
Craswell
Crow
DCMI
de Sompel
de Sompel
Dijk
French
Gatenby
Gravano
Joint Information Systems Committee
Lagoze
Lagoze
Lagoze
Larson
Liu
Lu
Lu
Lu
Lynch
Nelson
Nottelmann
Paepcke
Sanderson
Simeoni
Simon
Simons
Suleman
van der Kuil
Warner
Witten
Yang
Z39.50 Maintenance Agency
Publication date: 1 January 2007
Publisher
Doi

Abstract

We propose an approach to content-based Distributed Information Retrieval based on the periodic and incremental centralisation of full content indices of widely dispersed and autonomously managed document sources. Inspired by the success of the Open Archive Initiative’s protocol for metadata harvesting, the approach occupies middle ground between content crawling and distributed retrieval. As in crawling, some data moves towards the retrieval process, but it is statistics about the content rather than content itself; this grants more efficient use of network resources and wider scope of application. As in distributed retrieval, some processing is distributed along with the data, but it is indexing rather than retrieval; this reduces the costs of content provision whilst promoting the simplicity, effectiveness, and responsiveness of retrieval. Overall, we argue that the approach retains the good properties of centralised retrieval without renouncing to cost-effective, large-scale resource pooling. We discuss the requirements associated with the approach and identify two strategies to deploy it on top of the OAI infrastructure. In particular, we define a minimal extension of the OAI protocol which supports the coordinated harvesting of full-content indices and descriptive metadata for content resources. Finally, we report on the implementation of a proof-of-concept prototype service for multi-model content-based retrieval of distributed file collections

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Crossref

Last time updated on 10/12/2020

RERO DOC Digital Library

oai:doc.rero.ch:20090112144509...

Last time updated on 20/08/2014