7,758 research outputs found
Empirical Methodology for Crowdsourcing Ground Truth
The process of gathering ground truth data through human annotation is a
major bottleneck in the use of information extraction methods for populating
the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the
attempt to solve the issues related to volume of data and lack of annotators.
Typically these practices use inter-annotator agreement as a measure of
quality. However, in many domains, such as event detection, there is ambiguity
in the data, as well as a multitude of perspectives of the information
examples. We present an empirically derived methodology for efficiently
gathering of ground truth data in a diverse set of use cases covering a variety
of domains and annotation tasks. Central to our approach is the use of
CrowdTruth metrics that capture inter-annotator disagreement. We show that
measuring disagreement is essential for acquiring a high quality ground truth.
We achieve this by comparing the quality of the data aggregated with CrowdTruth
metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical
Relation Extraction, Twitter Event Identification, News Event Extraction and
Sound Interpretation. We also show that an increased number of crowd workers
leads to growth and stabilization in the quality of annotations, going against
the usual practice of employing a small number of annotators.Comment: in publication at the Semantic Web Journa
Evaluation campaigns and TRECVid
The TREC Video Retrieval Evaluation (TRECVid) is an
international benchmarking activity to encourage research
in video information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results. TRECVid completed its fifth annual cycle at the end of 2005 and in 2006 TRECVid will involve almost 70 research organizations, universities and other consortia. Throughout its existence, TRECVid has benchmarked both interactive and automatic/manual searching for shots from within a video
corpus, automatic detection of a variety of semantic and
low-level video features, shot boundary detection and the
detection of story boundaries in broadcast TV news. This
paper will give an introduction to information retrieval (IR) evaluation from both a user and a system perspective, highlighting that system evaluation is by far the most prevalent type of evaluation carried out. We also include a summary of TRECVid as an example of a system evaluation benchmarking campaign and this allows us to discuss whether
such campaigns are a good thing or a bad thing. There are
arguments for and against these campaigns and we present
some of them in the paper concluding that on balance they
have had a very positive impact on research progress
Symbiosis between the TRECVid benchmark and video libraries at the Netherlands Institute for Sound and Vision
Audiovisual archives are investing in large-scale digitisation efforts of their analogue holdings and, in parallel, ingesting an ever-increasing amount of born- digital files in their digital storage facilities. Digitisation opens up new access paradigms and boosted re-use of audiovisual content. Query-log analyses show the shortcomings of manual annotation, therefore archives are complementing these annotations by developing novel search engines that automatically extract information from both audio and the visual tracks. Over the past few years, the TRECVid benchmark has developed a novel relationship with the Netherlands Institute of Sound and Vision (NISV) which goes beyond the NISV just providing data and use cases to TRECVid. Prototype and demonstrator systems developed as part of TRECVid are set to become a key driver in improving the quality of search engines at the NISV and will ultimately help other audiovisual archives to offer more efficient and more fine-grained access to their collections. This paper reports the experiences of NISV in leveraging the activities of the TRECVid benchmark
Access to recorded interviews: A research agenda
Recorded interviews form a rich basis for scholarly inquiry. Examples include oral histories, community memory projects, and interviews conducted for broadcast media. Emerging technologies offer the potential to radically transform the way in which recorded interviews are made accessible, but this vision will demand substantial investments from a broad range of research communities. This article reviews the present state of practice for making recorded interviews available and the state-of-the-art for key component technologies. A large number of important research issues are identified, and from that set of issues, a coherent research agenda is proposed
TREC video retrieval evaluation: a case study and status report
The TREC Video Retrieval Evaluation is a multiyear, international effort, funded by the US Advanced Research and Development Agency (ARDA) and the National Institute of Standards and Technology (NIST) to promote progress in content-based retrieval from digital video via open, metrics-based evaluation. Now beginning its fourth year, it aims over time to develop both a better understanding of
how systems can effectively accomplish such retrieval
and how one can reliably benchmark their performance. This paper can be seen as a case study in the development of video retrieval systems and their evaluation as well as a report on their status to-date. After an introduction to the evolution of the evaluation over the past three years, the paper reports on the most recent evaluation TRECVID 2003: the evaluation framework ā the 4 tasks (shot boundary determination, high-level feature extraction, story segmentation and typing, search), 133 hours of US television
news data, and measures ā, the results, and the approaches taken by the 24 participating groups
PEANUT: A Human-AI Collaborative Tool for Annotating Audio-Visual Data
Audio-visual learning seeks to enhance the computer's multi-modal perception
leveraging the correlation between the auditory and visual modalities. Despite
their many useful downstream tasks, such as video retrieval, AR/VR, and
accessibility, the performance and adoption of existing audio-visual models
have been impeded by the availability of high-quality datasets. Annotating
audio-visual datasets is laborious, expensive, and time-consuming. To address
this challenge, we designed and developed an efficient audio-visual annotation
tool called Peanut. Peanut's human-AI collaborative pipeline separates the
multi-modal task into two single-modal tasks, and utilizes state-of-the-art
object detection and sound-tagging models to reduce the annotators' effort to
process each frame and the number of manually-annotated frames needed. A
within-subject user study with 20 participants found that Peanut can
significantly accelerate the audio-visual data annotation process while
maintaining high annotation accuracy.Comment: 18 pages, published in UIST'2
"You Tube and I Find" - personalizing multimedia content access
Recent growth in broadband access and proliferation of small personal devices that capture images and videos has led to explosive growth of multimedia content available everywhereVfrom personal disks to the Web. While digital media capture and upload has become nearly universal with newer device technology, there is still a need for better tools and technologies to search large collections of multimedia data and to find and deliver the right content to a user according to her current needs and preferences. A renewed focus on the subjective dimension in the multimedia lifecycle, fromcreation, distribution, to delivery and consumption, is required to address this need beyond what is feasible today. Integration of the subjective aspects of the media itselfVits affective, perceptual, and physiological potential (both intended and achieved), together with those of the users themselves will allow for personalizing the content access, beyond today’s facility. This integration, transforming the traditional multimedia information retrieval (MIR) indexes to more effectively answer specific user needs, will allow a richer degree of personalization predicated on user intention and mode of interaction, relationship to the producer, content of the media, and their history and lifestyle. In this paper, we identify the challenges in achieving this integration, current approaches to interpreting content creation processes, to user modelling and profiling, and to personalized content selection, and we detail future directions. The structure of the paper is as follows: In Section I, we introduce the problem and present some definitions. In Section II, we present a review of the aspects of personalized content and current approaches for the same. Section III discusses the problem of obtaining metadata that is required for personalized media creation and present eMediate as a case study of an integrated media capture environment. Section IV presents the MAGIC system as a case study of capturing effective descriptive data and putting users first in distributed learning delivery. The aspects of modelling the user are presented as a case study in using user’s personality as a way to personalize summaries in Section V. Finally, Section VI concludes the paper with a discussion on the emerging challenges and the open problems
FSD50K: an Open Dataset of Human-Labeled Sound Events
Most existing datasets for sound event recognition (SER) are relatively small
and/or domain-specific, with the exception of AudioSet, based on a massive
amount of audio tracks from YouTube videos and encompassing over 500 classes of
everyday sounds. However, AudioSet is not an open dataset---its release
consists of pre-computed audio features (instead of waveforms), which limits
the adoption of some SER methods. Downloading the original audio tracks is also
problematic due to constituent YouTube videos gradually disappearing and usage
rights issues, which casts doubts over the suitability of this resource for
systems' benchmarking. To provide an alternative benchmark dataset and thus
foster SER research, we introduce FSD50K, an open dataset containing over 51k
audio clips totalling over 100h of audio manually labeled using 200 classes
drawn from the AudioSet Ontology. The audio clips are licensed under Creative
Commons licenses, making the dataset freely distributable (including
waveforms). We provide a detailed description of the FSD50K creation process,
tailored to the particularities of Freesound data, including challenges
encountered and solutions adopted. We include a comprehensive dataset
characterization along with discussion of limitations and key factors to allow
its audio-informed usage. Finally, we conduct sound event classification
experiments to provide baseline systems as well as insight on the main factors
to consider when splitting Freesound audio data for SER. Our goal is to develop
a dataset to be widely adopted by the community as a new open benchmark for SER
research
- ā¦