3,910 research outputs found
Deep Multimodal Speaker Naming
Automatic speaker naming is the problem of localizing as well as identifying
each speaking character in a TV/movie/live show video. This is a challenging
problem mainly attributes to its multimodal nature, namely face cue alone is
insufficient to achieve good performance. Previous multimodal approaches to
this problem usually process the data of different modalities individually and
merge them using handcrafted heuristics. Such approaches work well for simple
scenes, but fail to achieve high performance for speakers with large appearance
variations. In this paper, we propose a novel convolutional neural networks
(CNN) based learning framework to automatically learn the fusion function of
both face and audio cues. We show that without using face tracking, facial
landmark localization or subtitle/transcript, our system with robust multimodal
feature extraction is able to achieve state-of-the-art speaker naming
performance evaluated on two diverse TV series. The dataset and implementation
of our algorithm are publicly available online
A Review: Movie Character Identification Based on Graph Matching
With the rapid development of movie and television industry a huge amount of movie and television data is being generated every day. To manage this data, efficient and effective technique is required, which understand the video contents and organize it properly, Character identification of movie is challenging problem due to huge variation in the appearance of each character and complex background, large motion, non-rigid deformation, occlusion, huge pose, expression, wearing, clothing, even makeup and hairstyle changes and other uncontrolled condition make the result of face detection and face tracking unreliable
From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script
The goal of this paper is the automatic identification of characters in TV
and feature film material. In contrast to standard approaches to this task,
which rely on the weak supervision afforded by transcripts and subtitles, we
propose a new method requiring only a cast list. This list is used to obtain
images of actors from freely available sources on the web, providing a form of
partial supervision for this task. In using images of actors to recognize
characters, we make the following three contributions: (i) We demonstrate that
an automated semi-supervised learning approach is able to adapt from the
actor's face to the character's face, including the face context of the hair;
(ii) By building voice models for every character, we provide a bridge between
frontal faces (for which there is plenty of actor-level supervision) and
profile (for which there is very little or none); and (iii) by combining face
context and speaker identification, we are able to identify characters with
partially occluded faces and extreme facial poses. Results are presented on the
TV series 'Sherlock' and the feature film 'Casablanca'. We achieve the
state-of-the-art on the Casablanca benchmark, surpassing previous methods that
have used the stronger supervision available from transcripts
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
AI-Generated Content (AIGC): A Survey
To address the challenges of digital intelligence in the digital economy,
artificial intelligence-generated content (AIGC) has emerged. AIGC uses
artificial intelligence to assist or replace manual content generation by
generating content based on user-inputted keywords or requirements. The
development of large model algorithms has significantly strengthened the
capabilities of AIGC, which makes AIGC products a promising generative tool and
adds convenience to our lives. As an upstream technology, AIGC has unlimited
potential to support different downstream applications. It is important to
analyze AIGC's current capabilities and shortcomings to understand how it can
be best utilized in future applications. Therefore, this paper provides an
extensive overview of AIGC, covering its definition, essential conditions,
cutting-edge capabilities, and advanced features. Moreover, it discusses the
benefits of large-scale pre-trained models and the industrial chain of AIGC.
Furthermore, the article explores the distinctions between auxiliary generation
and automatic generation within AIGC, providing examples of text generation.
The paper also examines the potential integration of AIGC with the Metaverse.
Lastly, the article highlights existing issues and suggests some future
directions for application.Comment: Preprint. 14 figures, 4 table
A Labeling Task Design for Supporting Algorithmic Needs: Facilitating Worker Diversity and Reducing AI Bias
Studies on supervised machine learning (ML) recommend involving workers from
various backgrounds in training dataset labeling to reduce algorithmic bias.
Moreover, sophisticated tasks for categorizing objects in images are necessary
to improve ML performance, further complicating micro-tasks. This study aims to
develop a task design incorporating the fair participation of people,
regardless of their specific backgrounds or task's difficulty. By collaborating
with 75 labelers from diverse backgrounds for 3 months, we analyzed workers'
log-data and relevant narratives to identify the task's hurdles and helpers.
The findings revealed that workers' decision-making tendencies varied depending
on their backgrounds. We found that the community that positively helps workers
and the machine's feedback perceived by workers could make people easily
engaged in works. Hence, ML's bias could be expectedly mitigated. Based on
these findings, we suggest an extended human-in-the-loop approach that connects
labelers, machines, and communities rather than isolating individual workers.Comment: 45 pages, 4 figure
- …