Search CORE

542 research outputs found

Relevance Judgments between TREC and Non-TREC Assessors

Author: Al-Maskari A.
Clough P.
Sanderson M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2008
Field of study

This paper investigates the agreement of relevance assessments between official TREC judgments and those generated from an interactive IR experiment. Results show that 63% of documents judged relevant by our users matched official TREC judgments. Several factors contributed to differences in the agreements: the number of retrieved relevant documents; the number of relevant documents judged; system effectiveness per topic and the ranking of relevant documents

CiteSeerX

Crossref

White Rose Research Online

A Comparison of Nuggets and Clusters for Evaluating Timeline Summaries

Author: Bart NicolaĂŻ (3420290)
Christopher Watkins (3420185)
David Rudell (3420188)
Ines Hanrahan (3420194)
James Giovannoni (14146)
James Mattheis (3420182)
Jason Johnston (546641)
Maarten Hertog (2357374)
Nathanael Sullivan (3420179)
Nigel Gapper (3420191)
Rachel Leisso (3420173)
Robert Schaffer (470405)
Publication venue
Publication date: 01/01/2017
Field of study

There is growing interest in systems that generate timeline summaries by filtering high-volume streams of documents to retain only those that are relevant to a particular event or topic. Continued advances in algorithms and techniques for this task depend on standardized and reproducible evaluation methodologies for comparing systems. However, timeline summary evaluation is still in its infancy, with competing methodologies currently being explored in international evaluation forums such as TREC. One area of active exploration is how to explicitly represent the units of information that should appear in a 'good' summary. Currently, there are two main approaches, one based on identifying nuggets in an external 'ground truth', and the other based on clustering system outputs. In this paper, by building test collections that have both nugget and cluster annotations, we are able to compare these two approaches. Specifically, we address questions related to evaluation effort, differences in the final evaluation products, and correlations between scores and rankings generated by both approaches. We summarize advantages and disadvantages of nuggets and clusters to offer recommendations for future system evaluation

Crossref

Enlighten

FigShare

Creating a test collection to evaluate diversity in image retrieval

Author: Arni T.
Clough P.
Sanderson M.
Tang J.
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/01/2008
Field of study

This paper describes the adaptation of an existing test collection for image retrieval to enable diversity in the results set to be measured. Previous research has shown that a more diverse set of results often satisfies the needs of more users better than standard document rankings. To enable diversity to be quantified, it is necessary to classify images relevant to a given theme to one or more sub-topics or clusters. We describe the challenges in building (as far as we are aware) the first test collection for evaluating diversity in image retrieval. This includes selecting appropriate topics, creating sub-topics, and quantifying the overall effectiveness of a retrieval system. A total of 39 topics were augmented for cluster-based relevance and we also provide an initial analysis of assessor agreement for grouping relevant images into sub-topics or clusters

RMIT Research Repository

White Rose Research Online

A study of inter-annotator agreement for opinion retrieval

Author: Bermingham Adam
Smeaton Alan F.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2009
Field of study

Evaluation of sentiment analysis, like large-scale IR evalu- ation, relies on the accuracy of human assessors to create judgments. Subjectivity in judgments is a problem for rel- evance assessment and even more so in the case of senti- ment annotations. In this study we examine the degree to which assessors agree upon sentence-level sentiment anno- tation. We show that inter-assessor agreement is not con- tingent on document length or frequency of sentiment but correlates positively with automated opinion retrieval per- formance. We also examine the individual annotation cate- gories to determine which categories pose most di±culty for annotators

CiteSeerX

Crossref

Irish Universities

DCU Online Research Access Service

Human assessments of document similarity

Author: Belkin
Belz
Cavnar
Cavnar
Damashek
Damashek
Flesch
Fox
Furnas
Gardenfors
Haenggi
Harman
Harman
Harman
Hjørland
Johnson-Laird
Järvelin
Landauer
Lee
Lin
Lund
Miller
Morris
Resnik
Salton
Saracevic
Skupin
Vorhees
Westerman
Publication venue: 'Wiley'
Publication date: 01/01/2010
Field of study

Two studies are reported that examined the reliability of human assessments of document similarity and the association between human ratings and the results of n-gram automatic text analysis (ATA). Human interassessor reliability (IAR) was moderate to poor. However, correlations between average human ratings and n-gram solutions were strong. The average correlation between ATA and individual human solutions was greater than IAR. N-gram length influenced the strength of association, but optimum string length depended on the nature of the text (technical vs. nontechnical). We conclude that the methodology applied in previous studies may have led to overoptimistic views on human reliability, but that an optimal n-gram solution can provide a good approximation of the average human assessment of document similarity, a result that has important implications for future development of document visualization systems

Crossref

University of Gloucestershire Research Repository

Brunel University Research Archive

Overview of the CLEF-2005 cross-language speech retrieval track

Author: Huang Xiaoli
Jones Gareth J.F.
Oard Douglas W.
Soergel Dagobert
White Ryen W.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

The task for the CLEF-2005 cross-language speech retrieval track was to identify topically coherent segments of English interviews in a known-boundary condition. Seven teams participated, performing both monolingual and cross-language searches of ASR transcripts, automatically generated metadata, and manually generated metadata. Results indicate that monolingual search technology is sufficiently accurate to be useful for some purposes (the best mean average precision was 0.18) and cross-language searching yielded results typical of those seen in other applications (with the best systems approximating monolingual mean average precision)

DCU Online Research Access Service

Active Sampling for Large-scale Information Retrieval Evaluation

Author: Kanoulas Evangelos
Li Dan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Evaluation is crucial in Information Retrieval. The development of models, tools and methods has significantly benefited from the availability of reusable test collections formed through a standardized and thoroughly tested methodology, known as the Cranfield paradigm. Constructing these collections requires obtaining relevance judgments for a pool of documents, retrieved by systems participating in an evaluation task; thus involves immense human labor. To alleviate this effort different methods for constructing collections have been proposed in the literature, falling under two broad categories: (a) sampling, and (b) active selection of documents. The former devises a smart sampling strategy by choosing only a subset of documents to be assessed and inferring evaluation measure on the basis of the obtained sample; the sampling distribution is being fixed at the beginning of the process. The latter recognizes that systems contributing documents to be judged vary in quality, and actively selects documents from good systems. The quality of systems is measured every time a new document is being judged. In this paper we seek to solve the problem of large-scale retrieval evaluation combining the two approaches. We devise an active sampling method that avoids the bias of the active selection methods towards good systems, and at the same time reduces the variance of the current sampling approaches by placing a distribution over systems, which varies as judgments become available. We validate the proposed method using TREC data and demonstrate the advantages of this new method compared to past approaches

arXiv.org e-Print Archive

Crossref

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Towards automatic generation of relevance judgments for a test collection

Author: Makary Mireille
Oakes Michael
Yamout Fadi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2016
Field of study

This paper represents a new technique for building a relevance judgment list for information retrieval test collections without any human intervention. It is based on the number of occurrences of the documents in runs retrieved from several information retrieval systems and a distance based measure between the documents. The effectiveness of the technique is evaluated by computing the correlation between the ranking of the TREC systems using the original relevance judgment list (qrels) built by human assessors and the ranking obtained by using the newly generated qrels

Crossref

Wolverhampton Intellectual Repository and E-theses

PACRR: A Position-Aware Neural IR Model for Relevance Matching

Author: Berberich Klaus
de Melo Gerard
Hui Kai
Yates Andrew
Publication venue
Publication date: 01/01/2017
Field of study

In order to adopt deep learning for information retrieval, models are needed that can capture all relevant information required to assess the relevance of a document to a given user query. While previous works have successfully captured unigram term matches, how to fully employ position-dependent information such as proximity and term dependencies has been insufficiently explored. In this work, we propose a novel neural IR model named PACRR aiming at better modeling position-dependent interactions between a query and a document. Extensive experiments on six years' TREC Web Track data confirm that the proposed model yields better results under multiple benchmarks.Comment: To appear in EMNLP201

arXiv.org e-Print Archive

Crossref

MPG.PuRe