13,853 research outputs found
Analysis on Using Synthesized Singing Techniques in Assistive Interfaces for Visually Impaired to Study Music
Tactile and auditory senses are the basic types of methods that visually impaired people sense the world. Their interaction with assistive technologies also focuses mainly on tactile and auditory interfaces. This research paper discuss about the validity of using most appropriate singing synthesizing techniques as a mediator in assistive technologies specifically built to address their music learning needs engaged with music scores and lyrics. Music scores with notations and lyrics are considered as the main mediators in musical communication channel which lies between a composer and a performer. Visually impaired music lovers have less opportunity to access this main mediator since most of them are in visual format. If we consider a music score, the vocal performer’s melody is married to all the pleasant sound producible in the form of singing. Singing best fits for a format in temporal domain compared to a tactile format in spatial domain. Therefore, conversion of existing visual format to a singing output will be the most appropriate nonlossy transition as proved by the initial research on adaptive music score trainer for visually impaired [1]. In order to extend the paths of this initial research, this study seek on existing singing synthesizing techniques and researches on auditory interfaces
Make-A-Voice: Unified Voice Synthesis With Discrete Representation
Various applications of voice synthesis have been developed independently
despite the fact that they generate "voice" as output in common. In addition,
the majority of voice synthesis models currently rely on annotated audio data,
but it is crucial to scale them to self-supervised datasets in order to
effectively capture the wide range of acoustic variations present in human
voice, including speaker identity, emotion, and prosody. In this work, we
propose Make-A-Voice, a unified framework for synthesizing and manipulating
voice signals from discrete representations. Make-A-Voice leverages a
"coarse-to-fine" approach to model the human voice, which involves three
stages: 1) semantic stage: model high-level transformation between linguistic
content and self-supervised semantic tokens, 2) acoustic stage: introduce
varying control signals as acoustic conditions for semantic-to-acoustic
modeling, and 3) generation stage: synthesize high-fidelity waveforms from
acoustic tokens. Make-A-Voice offers notable benefits as a unified voice
synthesis framework: 1) Data scalability: the major backbone (i.e., acoustic
and generation stage) does not require any annotations, and thus the training
data could be scaled up. 2) Controllability and conditioning flexibility: we
investigate different conditioning mechanisms and effectively handle three
voice synthesis applications, including text-to-speech (TTS), voice conversion
(VC), and singing voice synthesis (SVS) by re-synthesizing the discrete voice
representations with prompt guidance. Experimental results demonstrate that
Make-A-Voice exhibits superior audio quality and style similarity compared with
competitive baseline models. Audio samples are available at
https://Make-A-Voice.github.i
Capture, Learning, and Synthesis of 3D Speaking Styles
Audio-driven 3D facial animation has been widely explored, but achieving
realistic, human-like performance is still unsolved. This is due to the lack of
available 3D datasets, models, and standard evaluation metrics. To address
this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans
captured at 60 fps and synchronized audio from 12 speakers. We then train a
neural network on our dataset that factors identity from facial motion. The
learned model, VOCA (Voice Operated Character Animation) takes any speech
signal as input - even speech in languages other than English - and
realistically animates a wide range of adult faces. Conditioning on subject
labels during training allows the model to learn a variety of realistic
speaking styles. VOCA also provides animator controls to alter speaking style,
identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball
rotations) during animation. To our knowledge, VOCA is the only realistic 3D
facial animation model that is readily applicable to unseen subjects without
retargeting. This makes VOCA suitable for tasks like in-game video, virtual
reality avatars, or any scenario in which the speaker, speech, or language is
not known in advance. We make the dataset and model available for research
purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201
Speech Processing in Computer Vision Applications
Deep learning has been recently proven to be a viable asset in determining features in the field of Speech Analysis. Deep learning methods like Convolutional Neural Networks facilitate the expansion of specific feature information in waveforms, allowing networks to create more feature dense representations of data. Our work attempts to address the problem of re-creating a face given a speaker\u27s voice and speaker identification using deep learning methods. In this work, we first review the fundamental background in speech processing and its related applications. Then we introduce novel deep learning-based methods to speech feature analysis. Finally, we will present our deep learning approaches to speaker identification and speech to face synthesis. The presented method can convert a speaker audio sample to an image of their predicted face. This framework is composed of several chained together networks, each with an essential step in the conversion process. These include Audio embedding, encoding, and face generation networks, respectively. Our experiments show that certain features can map to the face and that with a speaker\u27s voice, DNNs can create their face and that a GUI could be used in conjunction to display a speaker recognition network\u27s data
17 ways to say yes:Toward nuanced tone of voice in AAC and speech technology
People with complex communication needs who use speech-generating devices have very little expressive control over their tone of voice. Despite its importance in human interaction, the issue of tone of voice remains all but absent from AAC research and development however. In this paper, we describe three interdisciplinary projects, past, present and future: The critical design collection Six Speaking Chairs has provoked deeper discussion and inspired a social model of tone of voice; the speculative concept Speech Hedge illustrates challenges and opportunities in designing more expressive user interfaces; the pilot project Tonetable could enable participatory research and seed a research network around tone of voice. We speculate that more radical interactions might expand frontiers of AAC and disrupt speech technology as a whole
SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers.
In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range.
To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems
Synthetic Speaking Children -- Why We Need Them and How to Make Them
Contemporary Human Computer Interaction (HCI) research relies primarily on
neural network models for machine vision and speech understanding of a system
user. Such models require extensively annotated training datasets for optimal
performance and when building interfaces for users from a vulnerable population
such as young children, GDPR introduces significant complexities in data
collection, management, and processing. Motivated by the training needs of an
Edge AI smart toy platform this research explores the latest advances in
generative neural technologies and provides a working proof of concept of a
controllable data generation pipeline for speech driven facial training data at
scale. In this context, we demonstrate how StyleGAN2 can be finetuned to create
a gender balanced dataset of children's faces. This dataset includes a variety
of controllable factors such as facial expressions, age variations, facial
poses, and even speech-driven animations with realistic lip synchronization. By
combining generative text to speech models for child voice synthesis and a 3D
landmark based talking heads pipeline, we can generate highly realistic,
entirely synthetic, talking child video clips. These video clips can provide
valuable, and controllable, synthetic training data for neural network models,
bridging the gap when real data is scarce or restricted due to privacy
regulations.Comment: Presented at SpeD 2
- …