17,132 research outputs found
Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition
This paper presents a self-supervised method for visual detection of the
active speaker in a multi-person spoken interaction scenario. Active speaker
detection is a fundamental prerequisite for any artificial cognitive system
attempting to acquire language in social settings. The proposed method is
intended to complement the acoustic detection of the active speaker, thus
improving the system robustness in noisy conditions. The method can detect an
arbitrary number of possibly overlapping active speakers based exclusively on
visual information about their face. Furthermore, the method does not rely on
external annotations, thus complying with cognitive development. Instead, the
method uses information from the auditory modality to support learning in the
visual domain. This paper reports an extensive evaluation of the proposed
method using a large multi-person face-to-face interaction dataset. The results
show good performance in a speaker dependent setting. However, in a speaker
independent setting the proposed method yields a significantly lower
performance. We believe that the proposed method represents an essential
component of any artificial cognitive system or robotic platform engaging in
social interactions.Comment: 10 pages, IEEE Transactions on Cognitive and Developmental System
Recover Subjective Quality Scores from Noisy Measurements
Simple quality metrics such as PSNR are known to not correlate well with
subjective quality when tested across a wide spectrum of video content or
quality regime. Recently, efforts have been made in designing objective quality
metrics trained on subjective data (e.g. VMAF), demonstrating better
correlation with video quality perceived by human. Clearly, the accuracy of
such a metric heavily depends on the quality of the subjective data that it is
trained on. In this paper, we propose a new approach to recover subjective
quality scores from noisy raw measurements, using maximum likelihood
estimation, by jointly estimating the subjective quality of impaired videos,
the bias and consistency of test subjects, and the ambiguity of video contents
all together. We also derive closed-from expression for the confidence interval
of each estimate. Compared to previous methods which partially exploit the
subjective information, our approach is able to exploit the information in
full, yielding tighter confidence interval and better handling of outliers
without the need for z-scoring or subject rejection. It also handles missing
data more gracefully. Finally, as side information, it provides interesting
insights on the test subjects and video contents.Comment: 16 pages; abridged version appeared in Data Compression Conference
(DCC) 201
- …