7 research outputs found

    LiRA: Learning Visual Speech Representations from Audio through Self-supervision

    Get PDF
    The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training objective to learn from the other. In this work, we propose Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild (LRW) dataset and achieves state-of-the-art performance on Lip Reading Sentences 2 (LRS2) using only a fraction of the total labelled data.Comment: Accepted for publication at Interspeech 202

    LiRA: learning visual speech representations from audio through self-supervision

    Get PDF

    SLNSpeech: solving extended speech separation problem by the help of sign language

    Full text link
    A speech separation task can be roughly divided into audio-only separation and audio-visual separation. In order to make speech separation technology applied in the real scenario of the disabled, this paper presents an extended speech separation problem which refers in particular to sign language assisted speech separation. However, most existing datasets for speech separation are audios and videos which contain audio and/or visual modalities. To address the extended speech separation problem, we introduce a large-scale dataset named Sign Language News Speech (SLNSpeech) dataset in which three modalities of audio, visual, and sign language are coexisted. Then, we design a general deep learning network for the self-supervised learning of three modalities, particularly, using sign language embeddings together with audio or audio-visual information for better solving the speech separation task. Specifically, we use 3D residual convolutional network to extract sign language features and use pretrained VGGNet model to exact visual features. After that, an improved U-Net with skip connections in feature extraction stage is applied for learning the embeddings among the mixed spectrogram transformed from source audios, the sign language features and visual features. Experiments results show that, besides visual modality, sign language modality can also be used alone to supervise speech separation task. Moreover, we also show the effectiveness of sign language assisted speech separation when the visual modality is disturbed. Source code will be released in http://cheertt.top/homepage/Comment: 33 pages, 8 figures, 5 table

    Don't miss the Mismatch: Investigating the Objective Function Mismatch for Unsupervised Representation Learning

    Get PDF
    Finding general evaluation metrics for unsupervised representation learning techniques is a challenging open research question, which recently has become more and more necessary due to the increasing interest in unsupervised methods. Even though these methods promise beneficial representation characteristics, most approaches currently suffer from the objective function mismatch. This mismatch states that the performance on a desired target task can decrease when the unsupervised pretext task is learned too long - especially when both tasks are ill-posed. In this work, we build upon the widely used linear evaluation protocol and define new general evaluation metrics to quantitatively capture the objective function mismatch and the more generic metrics mismatch. We discuss the usability and stability of our protocols on a variety of pretext and target tasks and study mismatches in a wide range of experiments. Thereby we disclose dependencies of the objective function mismatch across several pretext and target tasks with respect to the pretext model's representation size, target model complexity, pretext and target augmentations as well as pretext and target task types.Comment: 21 pages, 17 figure

    Survey of deep representation learning for speech emotion recognition

    Get PDF
    Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual eort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learning has led to improved SER performance and enabled rapid innovation. Its effectiveness has further increased with advances in deep learning (DL), which has facilitated \textit{deep representation learning} where hierarchical representations are automatically learned in a data-driven manner. This paper presents the first comprehensive survey on the important topic of deep representation learning for SER. We highlight various techniques, related challenges and identify important future areas of research. Our survey bridges the gap in the literature since existing surveys either focus on SER with hand-engineered features or representation learning in the general setting without focusing on SER

    Graph neural network for audio representation learning

    Get PDF
    Learning audio representations is an important task with many potential applications. Whether it takes the shape of speech, music, or ambient sounds, audio is a common form of data that may communicate rich information. Audio representation learning is also a fundamental ingredient of deep learning. However, learning a good representation is a challenging task. Audio representation learning can also enable more accurate downstream tasks both in audio and video, such as emotion recognition. For audio representation learning, such a representation should contain the information needed to understand the input sound and make discriminative patterns. This necessitates a sizable volume of carefully annotated data, which requires a considerable amount of labour. In this thesis, we propose a set of models for audio representation learning. We address the discriminative patterns by proposing graph structure and graph neural network to further process it. Our work is the first to consider the graph structure for audio data. In contrast to existing methods that use approximation, our first model proposes a manual graph structure and uses a graph convolution layer with accurate graph convolution operation. In the second model, By integrating a graph inception network, we expand the manually created graph structure and simultaneously learn it with the primary objective in our model. In the third model, we addressed the dearth of annotated data by including a semi-supervised graph technique that represents audio corpora as nodes in a graph and connects them depending on label information in smaller subgraphs. We brought up the issue of leveraging multimodal data to improve audio representation learning in addition to earlier works. To accommodate multimodal input data, we included heterogeneous graph data to our fourth model. Additionally, we created a new graph architecture to handle multimodal data
    corecore