135 research outputs found

    Speech-like Emotional Sound Generator by WaveNet

    Get PDF
    In this paper, we propose a new algorithm to generate Speech-like Emotional Sound (SES). Emotional information plays an important role in human communication, and speech is one of the most useful media to express emotions. Although, in general, speech conveys emotional information as well as linguistic information, we have undertaken the challenge to generate sounds that convey emotional information without linguistic information, which results in making conversations in human-machine interactions more natural in some situations by providing non-verbal emotional vocalizations. We call the generated sounds “speech-like”, because the sounds do not contain any linguistic information. For the purpose, we propose to employ WaveNet as a sound generator conditioned by only emotional IDs. The idea is quite different from WaveNet Vocoder that synthesizes speech using spectrum information as auxiliary features. The biggest advantage of the idea is to reduce the amount of emotional speech data for the training. The proposed algorithm consists of two steps. In the first step, WaveNet is trained to obtain phonetic features using a large speech database, and in the second step, WaveNet is re-trained using a small amount of emotional speech. Subjective listening evaluations showed that the SES could convey emotional information and was judged to sound like a human voice

    Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion

    Full text link
    Research on deep learning-powered voice conversion (VC) in speech-to-speech scenarios is getting increasingly popular. Although many of the works in the field of voice conversion share a common global pipeline, there is a considerable diversity in the underlying structures, methods, and neural sub-blocks used across research efforts. Thus, obtaining a comprehensive understanding of the reasons behind the choice of the different methods in the voice conversion pipeline can be challenging, and the actual hurdles in the proposed solutions are often unclear. To shed light on these aspects, this paper presents a scoping review that explores the use of deep learning in speech analysis, synthesis, and disentangled speech representation learning within modern voice conversion systems. We screened 621 publications from more than 38 different venues between the years 2017 and 2023, followed by an in-depth review of a final database consisting of 123 eligible studies. Based on the review, we summarise the most frequently used approaches to voice conversion based on deep learning and highlight common pitfalls within the community. Lastly, we condense the knowledge gathered, identify main challenges and provide recommendations for future research directions

    An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era

    Get PDF
    Speech is the fundamental mode of human communication, and its synthesis has long been a core priority in human-computer interaction research. In recent years, machines have managed to master the art of generating speech that is understandable by humans. But the linguistic content of an utterance encompasses only a part of its meaning. Affect, or expressivity, has the capacity to turn speech into a medium capable of conveying intimate thoughts, feelings, and emotions -- aspects that are essential for engaging and naturalistic interpersonal communication. While the goal of imparting expressivity to synthesised utterances has so far remained elusive, following recent advances in text-to-speech synthesis, a paradigm shift is well under way in the fields of affective speech synthesis and conversion as well. Deep learning, as the technology which underlies most of the recent advances in artificial intelligence, is spearheading these efforts. In the present overview, we outline ongoing trends and summarise state-of-the-art approaches in an attempt to provide a comprehensive overview of this exciting field.Comment: Submitted to the Proceedings of IEE

    Generative models for music using transformer architectures

    Get PDF
    openThis thesis focus on growth and impact of Transformes architectures which are mainly used for Natural Language Processing tasks for Audio generation. We think that music, with its notes, chords, and volumes, is a language. You could think of symbolic representation of music as human language. A brief sound synthesis history which gives basic foundation for modern AI-generated music models is mentioned . The most recent in AI-generated audio is carefully studied and instances of AI-generated music is told in many contexts. Deep learning models and their applications to real-world issues are one of the key subjects that are covered. The main areas of interest include transformer-based audio generation, including the training procedure, encoding and decoding techniques, and post-processing stages. Transformers have several key advantages, including long-term consistency and the ability to create minute-long audio compositions. Numerous studies on the various representations of music have been explained, including how neural network and deep learning techniques can be used to apply symbolic melodies, musical arrangements, style transfer, and sound production. This thesis largely focuses on transformation models, but it also recognises the importance of numerous AI-based generative models, including GAN. Overall, this thesis enhances generative models for music composition and provides a complete understanding of transformer design. It shows the possibilities of AI-generated sound synthesis by emphasising the most current developments.This thesis focus on growth and impact of Transformes architectures which are mainly used for Natural Language Processing tasks for Audio generation. We think that music, with its notes, chords, and volumes, is a language. You could think of symbolic representation of music as human language. A brief sound synthesis history which gives basic foundation for modern AI-generated music models is mentioned . The most recent in AI-generated audio is carefully studied and instances of AI-generated music is told in many contexts. Deep learning models and their applications to real-world issues are one of the key subjects that are covered. The main areas of interest include transformer-based audio generation, including the training procedure, encoding and decoding techniques, and post-processing stages. Transformers have several key advantages, including long-term consistency and the ability to create minute-long audio compositions. Numerous studies on the various representations of music have been explained, including how neural network and deep learning techniques can be used to apply symbolic melodies, musical arrangements, style transfer, and sound production. This thesis largely focuses on transformation models, but it also recognises the importance of numerous AI-based generative models, including GAN. Overall, this thesis enhances generative models for music composition and provides a complete understanding of transformer design. It shows the possibilities of AI-generated sound synthesis by emphasising the most current developments

    Synthetic Speaking Children -- Why We Need Them and How to Make Them

    Full text link
    Contemporary Human Computer Interaction (HCI) research relies primarily on neural network models for machine vision and speech understanding of a system user. Such models require extensively annotated training datasets for optimal performance and when building interfaces for users from a vulnerable population such as young children, GDPR introduces significant complexities in data collection, management, and processing. Motivated by the training needs of an Edge AI smart toy platform this research explores the latest advances in generative neural technologies and provides a working proof of concept of a controllable data generation pipeline for speech driven facial training data at scale. In this context, we demonstrate how StyleGAN2 can be finetuned to create a gender balanced dataset of children's faces. This dataset includes a variety of controllable factors such as facial expressions, age variations, facial poses, and even speech-driven animations with realistic lip synchronization. By combining generative text to speech models for child voice synthesis and a 3D landmark based talking heads pipeline, we can generate highly realistic, entirely synthetic, talking child video clips. These video clips can provide valuable, and controllable, synthetic training data for neural network models, bridging the gap when real data is scarce or restricted due to privacy regulations.Comment: Presented at SpeD 2

    Foley Music: Learning to Generate Music from Videos

    Full text link
    In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We then formulate music generation from videos as a motion-to-MIDI translation problem. We present a Graph-Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements. The MIDI event can then be converted to realistic music using an off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our models on videos containing a variety of music performances. Experimental results show that our model outperforms several existing systems in generating music that is pleasant to listen to. More importantly, the MIDI representations are fully interpretable and transparent, thus enabling us to perform music editing flexibly. We encourage the readers to watch the demo video with audio turned on to experience the results.Comment: ECCV 2020. Project page: http://foley-music.csail.mit.ed
    corecore