17,132 research outputs found

    Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition

    Full text link
    This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.Comment: 10 pages, IEEE Transactions on Cognitive and Developmental System

    Recover Subjective Quality Scores from Noisy Measurements

    Full text link
    Simple quality metrics such as PSNR are known to not correlate well with subjective quality when tested across a wide spectrum of video content or quality regime. Recently, efforts have been made in designing objective quality metrics trained on subjective data (e.g. VMAF), demonstrating better correlation with video quality perceived by human. Clearly, the accuracy of such a metric heavily depends on the quality of the subjective data that it is trained on. In this paper, we propose a new approach to recover subjective quality scores from noisy raw measurements, using maximum likelihood estimation, by jointly estimating the subjective quality of impaired videos, the bias and consistency of test subjects, and the ambiguity of video contents all together. We also derive closed-from expression for the confidence interval of each estimate. Compared to previous methods which partially exploit the subjective information, our approach is able to exploit the information in full, yielding tighter confidence interval and better handling of outliers without the need for z-scoring or subject rejection. It also handles missing data more gracefully. Finally, as side information, it provides interesting insights on the test subjects and video contents.Comment: 16 pages; abridged version appeared in Data Compression Conference (DCC) 201
    • …
    corecore