4 research outputs found
An overview on the evaluated video retrieval tasks at TRECVID 2022
The TREC Video Retrieval Evaluation (TRECVID) is a TREC-style video analysis
and retrieval evaluation with the goal of promoting progress in research and
development of content-based exploitation and retrieval of information from
digital video via open, tasks-based evaluation supported by metrology. Over the
last twenty-one years this effort has yielded a better understanding of how
systems can effectively accomplish such processing and how one can reliably
benchmark their performance. TRECVID has been funded by NIST (National
Institute of Standards and Technology) and other US government agencies. In
addition, many organizations and individuals worldwide contribute significant
time and effort. TRECVID 2022 planned for the following six tasks: Ad-hoc video
search, Video to text captioning, Disaster scene description and indexing,
Activity in extended videos, deep video understanding, and movie summarization.
In total, 35 teams from various research organizations worldwide signed up to
join the evaluation campaign this year. This paper introduces the tasks,
datasets used, evaluation frameworks and metrics, as well as a high-level
results overview.Comment: arXiv admin note: substantial text overlap with arXiv:2104.13473,
arXiv:2009.0998
On Design and Evaluation of High-Recall Retrieval Systems for Electronic Discovery
High-recall retrieval is an information retrieval task model where the goal is to
identify, for human consumption, all, or as many as practicable, documents relevant to
a particular information need.
This thesis investigates the ways in which one can evaluate high-recall retrieval
systems and explores several design considerations that should be accounted for when designing
such systems for electronic discovery.
The primary contribution of this work is a framework for conducting high-recall retrieval
experimentation in a controlled and repeatable way.
This framework builds upon lessons learned from similar tasks to facilitate the use
of retrieval systems on collections that cannot be distributed due to the sensitivity
or privacy of the material contained within.
Accordingly, a Web API is used to distribute document collections,
informations needs, and corresponding relevance assessments in a one-document-at-a-time manner.
Validation is conducted through the successful deployment of this architecture in the 2015 TREC
Total Recall track over the live Web and in controlled environments.
Using the runs submitted to the Total Recall track and other test collections, we explore the
efficacy of a variety of new and existing effectiveness measures to high-recall retrieval tasks.
We find that summarizing the trade-off between recall and the effort required to attain that
recall is a non-trivial task and that several measures are sensitive to properties of the test
collections themselves.
We conclude that the gain curve, a de facto standard, and variants of the gain curve are the most robust to
variations in test collection properties and the evaluation of high-recall systems.
This thesis also explores the effect that non-authoritative, surrogate assessors can have
when training machine learning algorithms.
Contrary to popular thought, we find that surrogate assessors appear to be inferior
to authoritative assessors due to differences of opinion rather than innate inferiority in
their ability to identify relevance.
Furthermore, we show that several techniques for diversifying and liberalizing a surrogate
assessor's conception of relevance can yield substantial improvement in the surrogate
and, in some cases, rival the authority.
Finally, we present the results of a user study conducted to investigate the effect that
three archetypal high-recall retrieval systems have on judging behaviour.
Compared to using random and uncertainty sampling, selecting documents for training using relevance sampling significantly decreases the probability that
a user will identify that document as relevant.
On the other hand, no substantial difference between the test conditions is observed in the time taken to render
such assessments