267 research outputs found

    Leveraging audio-visual speech effectively via deep learning

    Get PDF
    The rising popularity of neural networks, combined with the recent proliferation of online audio-visual media, has led to a revolution in the way machines encode, recognize, and generate acoustic and visual speech. Despite the ubiquity of naturally paired audio-visual data, only a limited number of works have applied recent advances in deep learning to leverage the duality between audio and video within this domain. This thesis considers the use of neural networks to learn from large unlabelled datasets of audio-visual speech to enable new practical applications. We begin by training a visual speech encoder that predicts latent features extracted from the corresponding audio on a large unlabelled audio-visual corpus. We apply the trained visual encoder to improve performance on lip reading in real-world scenarios. Following this, we extend the idea of video learning from audio by training a model to synthesize raw speech directly from raw video, without the need for text transcriptions. Remarkably, we find that this framework is capable of reconstructing intelligible audio from videos of new, previously unseen speakers. We also experiment with a separate speech reconstruction framework, which leverages recent advances in sequence modeling and spectrogram inversion to improve the realism of the generated speech. We then apply our research in video-to-speech synthesis to advance the state-of-the-art in audio-visual speech enhancement, by proposing a new vocoder-based model that performs particularly well under extremely noisy scenarios. Lastly, we aim to fully realize the potential of paired audio-visual data by proposing two novel frameworks that leverage acoustic and visual speech to train two encoders that learn from each other simultaneously. We leverage these pre-trained encoders for deepfake detection, speech recognition, and lip reading, and find that they consistently yield improvements over training from scratch.Open Acces

    Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation

    Get PDF
    The task of Speaker LOCalization (SLOC) has been the focus of numerous works in the research field, where SLOC is performed on pure speech data, requiring the presence of an Oracle Voice Activity Detection (VAD) algorithm. Nevertheless, this perfect working condition is not satisfied in a real world scenario, where employed VADs do commit errors. This work addresses this issue with an extensive analysis focusing on the relationship between several data-driven VAD and SLOC models, finally proposing a reliable framework for VAD and SLOC. The effectiveness of the approach here discussed is assessed against a multi-room scenario, which is close to a real-world environment. Furthermore, up to the authors’ best knowledge, only one contribution proposes a unique framework for VAD and SLOC acting in this addressed scenario; however, this solution does not rely on data-driven approaches. This work comes as an extension of the authors’ previous research addressing the VAD and SLOC tasks, by proposing numerous advancements to the original neural network architectures. In details, four different models based on convolutional neural networks (CNNs) are here tested, in order to easily highlight the advantages of the introduced novelties. In addition, two different CNN models go under study for SLOC. Furthermore, training of data-driven models is here improved through a specific data augmentation technique. During this procedure, the room impulse responses (RIRs) of two virtual rooms are generated from the knowledge of the room size, reverberation time and microphones and sources placement. Finally, the only other framework for simultaneous detection and localization in a multi-room scenario is here taken into account to fairly compare the proposed method. As result, the proposed method is more accurate than the baseline framework, and remarkable improvements are specially observed when the data augmentation techniques are applied for both the VAD and SLOC tasks

    Deep Learning for Audio Segmentation and Intelligent Remixing

    Get PDF
    Audio segmentation divides an audio signal into homogenous sections such as music and speech. It is useful as a preprocessing step to index, store, and modify audio recordings, radio broadcasts and TV programmes. Machine learning models for audio segmentation are generally trained on copyrighted material, which cannot be shared across research groups. Furthermore, annotating these datasets is a time-consuming and expensive task. In this thesis, we present a novel approach that artificially synthesises data that resembles radio signals. We replicate the workflow of a radio DJ in mixing audio and investigate parameters like fade curves and audio ducking. Using this approach, we obtained state-of-the-art performance for music-speech detection on in-house and public datasets. After demonstrating the efficacy of training set synthesis, we investigate how audio ducking of background music impacts the precision and recall of the machine learning algorithm. Interestingly, we observed that the minimum level of audio ducking preferred by the machine learning algorithm was similar to that of human listeners. Furthermore, we observe that our proposed synthesis technique outperforms real-world data in some cases and serves as a promising alternative. This project also proposes a novel deep learning system called You Only Hear Once (YOHO), which is inspired by the YOLO algorithm popularly adopted in Computer Vision. We convert the detection of acoustic boundaries into a regression problem instead of frame-based classification. The relative improvement for F-measure of YOHO, compared to the state-of-the-art Convolutional Recurrent Neural Network, ranged from 1% to 6% across multiple datasets. As YOHO predicts acoustic boundaries directly, the speed of inference and post-processing steps are 6 times faster than frame-based classification. Furthermore, we investigate domain generalisation methods such as transfer learning and adversarial training. We demonstrated that these methods helped our algorithm perform better in unseen domains. In addition to audio segmentation, another objective of this project is to explore real-time radio remixing. This is a step towards building a customised radio and consequently, integrating it with the schedule of the listener. The system would remix music from the user’s personal playlist and play snippets of diary reminders at appropriate transition points. The intelligent remixing is governed by the underlying audio segmentation and other deep learning methods. We also explore how individuals can communicate with intelligent mixing systems through non-technical language. We demonstrated that word embeddings help in understanding representations of semantic descriptors

    Supervised Adversarial Contrastive Learning for Emotion Recognition in Conversations

    Full text link
    Extracting generalized and robust representations is a major challenge in emotion recognition in conversations (ERC). To address this, we propose a supervised adversarial contrastive learning (SACL) framework for learning class-spread structured representations. The framework applies contrast-aware adversarial training to generate worst-case samples and uses a joint class-spread contrastive learning objective on both original and adversarial samples. It can effectively utilize label-level feature consistency and retain fine-grained intra-class features. To avoid the negative impact of adversarial perturbations on context-dependent data, we design a contextual adversarial training strategy to learn more diverse features from context and enhance the model's context robustness. We develop a sequence-based method SACL-LSTM under this framework, to learn label-consistent and context-robust emotional features for ERC. Experiments on three datasets demonstrate that SACL-LSTM achieves state-of-the-art performance on ERC. Extended experiments prove the effectiveness of the SACL framework.Comment: 16 pages, accepted by ACL 202

    Deep Spoken Keyword Spotting:An Overview

    Get PDF
    Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS

    Artificial Intelligence in the Creative Industries: A Review

    Full text link
    This paper reviews the current state of the art in Artificial Intelligence (AI) technologies and applications in the context of the creative industries. A brief background of AI, and specifically Machine Learning (ML) algorithms, is provided including Convolutional Neural Network (CNNs), Generative Adversarial Networks (GANs), Recurrent Neural Networks (RNNs) and Deep Reinforcement Learning (DRL). We categorise creative applications into five groups related to how AI technologies are used: i) content creation, ii) information analysis, iii) content enhancement and post production workflows, iv) information extraction and enhancement, and v) data compression. We critically examine the successes and limitations of this rapidly advancing technology in each of these areas. We further differentiate between the use of AI as a creative tool and its potential as a creator in its own right. We foresee that, in the near future, machine learning-based AI will be adopted widely as a tool or collaborative assistant for creativity. In contrast, we observe that the successes of machine learning in domains with fewer constraints, where AI is the `creator', remain modest. The potential of AI (or its developers) to win awards for its original creations in competition with human creatives is also limited, based on contemporary technologies. We therefore conclude that, in the context of creative industries, maximum benefit from AI will be derived where its focus is human centric -- where it is designed to augment, rather than replace, human creativity
    • …
    corecore