2,034 research outputs found

    Audio-Visual Speaker Verification via Joint Cross-Attention

    Full text link
    Speaker verification has been widely explored using speech signals, which has shown significant improvement using deep models. Recently, there has been a surge in exploring faces and voices as they can offer more complementary and comprehensive information than relying only on a single modality of speech signals. Though current methods in the literature on the fusion of faces and voices have shown improvement over that of individual face or voice modalities, the potential of audio-visual fusion is not fully explored for speaker verification. Most of the existing methods based on audio-visual fusion either rely on score-level fusion or simple feature concatenation. In this work, we have explored cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification. Specifically, we estimate the cross-attention weights based on the correlation between the joint feature presentation and that of the individual feature representations in order to effectively capture both intra-modal as well inter-modal relationships among the faces and voices. We have shown that efficiently leveraging the intra- and inter-modal relationships significantly improves the performance of audio-visual fusion for speaker verification. The performance of the proposed approach has been evaluated on the Voxceleb1 dataset. Results show that the proposed approach can significantly outperform the state-of-the-art methods of audio-visual fusion for speaker verification

    Multi-Level Liveness Verification for Face-Voice Biometric Authentication

    Get PDF
    In this paper we present the details of the multilevel liveness verification (MLLV) framework proposed for realizing a secure face-voice biometric authentication system that can thwart different types of audio and video replay attacks. The proposed MLLV framework based on novel feature extraction and multimodal fusion approaches, uncovers the static and dynamic relationship between voice and face information from speaking faces, and allows multiple levels of security. Experiments with three different speaking corpora VidTIMIT, UCBN and AVOZES shows a significant improvement in system performance in terms of DET curves and equal error rates(EER) for different types of replay and synthesis attacks

    Audio-coupled video content understanding of unconstrained video sequences

    Get PDF
    Unconstrained video understanding is a difficult task. The main aim of this thesis is to recognise the nature of objects, activities and environment in a given video clip using both audio and video information. Traditionally, audio and video information has not been applied together for solving such complex task, and for the first time we propose, develop, implement and test a new framework of multi-modal (audio and video) data analysis for context understanding and labelling of unconstrained videos. The framework relies on feature selection techniques and introduces a novel algorithm (PCFS) that is faster than the well-established SFFS algorithm. We use the framework for studying the benefits of combining audio and video information in a number of different problems. We begin by developing two independent content recognition modules. The first one is based on image sequence analysis alone, and uses a range of colour, shape, texture and statistical features from image regions with a trained classifier to recognise the identity of objects, activities and environment present. The second module uses audio information only, and recognises activities and environment. Both of these approaches are preceded by detailed pre-processing to ensure that correct video segments containing both audio and video content are present, and that the developed system can be made robust to changes in camera movement, illumination, random object behaviour etc. For both audio and video analysis, we use a hierarchical approach of multi-stage classification such that difficult classification tasks can be decomposed into simpler and smaller tasks. When combining both modalities, we compare fusion techniques at different levels of integration and propose a novel algorithm that combines advantages of both feature and decision-level fusion. The analysis is evaluated on a large amount of test data comprising unconstrained videos collected for this work. We finally, propose a decision correction algorithm which shows that further steps towards combining multi-modal classification information effectively with semantic knowledge generates the best possible results

    Learning weakly supervised multimodal phoneme embeddings

    Full text link
    Recent works have explored deep architectures for learning multimodal speech representation (e.g. audio and images, articulation and audio) in a supervised way. Here we investigate the role of combining different speech modalities, i.e. audio and visual information representing the lips movements, in a weakly supervised way using Siamese networks and lexical same-different side information. In particular, we ask whether one modality can benefit from the other to provide a richer representation for phone recognition in a weakly supervised setting. We introduce mono-task and multi-task methods for merging speech and visual modalities for phone recognition. The mono-task learning consists in applying a Siamese network on the concatenation of the two modalities, while the multi-task learning receives several different combinations of modalities at train time. We show that multi-task learning enhances discriminability for visual and multimodal inputs while minimally impacting auditory inputs. Furthermore, we present a qualitative analysis of the obtained phone embeddings, and show that cross-modal visual input can improve the discriminability of phonological features which are visually discernable (rounding, open/close, labial place of articulation), resulting in representations that are closer to abstract linguistic features than those based on audio only

    AVA-AVD: Audio-Visual Speaker Diarization in the Wild

    Full text link
    Audio-visual speaker diarization aims at detecting "who spoke when" using both auditory and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios, which are quite different from in-the-wild videos in many scenarios such as movies, documentaries, and audience sitcoms. To develop diarization methods for these challenging videos, we create the AVA Audio-Visual Diarization (AVA-AVD) dataset. Our experiments demonstrate that adding AVA-AVD into training set can produce significantly better diarization models for in-the-wild videos despite that the data is relatively small. Moreover, this benchmark is challenging due to the diverse scenes, complicated acoustic conditions, and completely off-screen speakers. As a first step towards addressing the challenges, we design the Audio-Visual Relation Network (AVR-Net) which introduces a simple yet effective modality mask to capture discriminative information based on face visibility. Experiments show that our method not only can outperform state-of-the-art methods but is more robust as varying the ratio of off-screen speakers. Our data and code has been made publicly available at https://github.com/showlab/AVA-AVD.Comment: ACMMM 202

    Biometric liveness checking using multimodal fuzzy fusion

    Get PDF

    Processing and Linking Audio Events in Large Multimedia Archives: The EU inEvent Project

    Get PDF
    In the inEvent EU project [1], we aim at structuring, retrieving, and sharing large archives of networked, and dynamically changing, multimedia recordings, mainly consisting of meetings, videoconferences, and lectures. More specifically, we are developing an integrated system that performs audiovisual processing of multimedia recordings, and labels them in terms of interconnected “hyper-events ” (a notion inspired from hyper-texts). Each hyper-event is composed of simpler facets, including audio-video recordings and metadata, which are then easier to search, retrieve and share. In the present paper, we mainly cover the audio processing aspects of the system, including speech recognition, speaker diarization and linking (across recordings), the use of these features for hyper-event indexing and recommendation, and the search portal. We present initial results for feature extraction from lecture recordings using the TED talks. Index Terms: Networked multimedia events; audio processing: speech recognition; speaker diarization and linking; multimedia indexing and searching; hyper-events. 1
    • 

    corecore