7 research outputs found
Active Sampling for Large-scale Information Retrieval Evaluation
Evaluation is crucial in Information Retrieval. The development of models,
tools and methods has significantly benefited from the availability of reusable
test collections formed through a standardized and thoroughly tested
methodology, known as the Cranfield paradigm. Constructing these collections
requires obtaining relevance judgments for a pool of documents, retrieved by
systems participating in an evaluation task; thus involves immense human labor.
To alleviate this effort different methods for constructing collections have
been proposed in the literature, falling under two broad categories: (a)
sampling, and (b) active selection of documents. The former devises a smart
sampling strategy by choosing only a subset of documents to be assessed and
inferring evaluation measure on the basis of the obtained sample; the sampling
distribution is being fixed at the beginning of the process. The latter
recognizes that systems contributing documents to be judged vary in quality,
and actively selects documents from good systems. The quality of systems is
measured every time a new document is being judged. In this paper we seek to
solve the problem of large-scale retrieval evaluation combining the two
approaches. We devise an active sampling method that avoids the bias of the
active selection methods towards good systems, and at the same time reduces the
variance of the current sampling approaches by placing a distribution over
systems, which varies as judgments become available. We validate the proposed
method using TREC data and demonstrate the advantages of this new method
compared to past approaches
Crowd-annotation and LoD-based semantic indexing of content in multi-disciplinary web repositories to improve search results
Searching for relevant information in multi-disciplinary web
repositories is becoming a topic of increasing interest among the
computer science research community. To date, methods and techniques to extract useful and relevant information from
online repositories of research data have largely been based on
static full text indexing which entails a ‘produce once and use
forever’ kind of strategy. That strategy is fast becoming
insufficient due to increasing data volume, concept
obsolescence, and complexity and heterogeneity of content types
in web repositories. We propose that by automatic semantic
annotation of content in web repositories (using Linked Open
Data or LoD sources) without
using domain-specific ontologies,
we can sustain the performance of searching by retrieving highly
relevant search results. Secondly, we claim that by expert
crowd-annotation of content on top of automatic semantic
annotation, we can enrich the semantic index over time to
augment the contextual value of content in web repositories so
that they remain findable despite changes in language,
terminology and scientific concepts. We deployed a custom-
built annotation, indexing and searching environment in a web
repository website that has been used by expert annotators to
annotate webpages using free text and vocabulary terms. We
present our findings based on the annotation and tagging data on
top of LoD-based annotations and the overall
modus operandi.
We also analyze and demonstrate that by adding expert
annotations to the existing semantic index, we can improve the
relationship between query and documents using Cosine
Similarity Measures (CSM)