16 research outputs found
Deep neural network techniques for monaural speech enhancement: state of the art analysis
Deep neural networks (DNN) techniques have become pervasive in domains such
as natural language processing and computer vision. They have achieved great
success in these domains in task such as machine translation and image
generation. Due to their success, these data driven techniques have been
applied in audio domain. More specifically, DNN models have been applied in
speech enhancement domain to achieve denosing, dereverberation and
multi-speaker separation in monaural speech enhancement. In this paper, we
review some dominant DNN techniques being employed to achieve speech
separation. The review looks at the whole pipeline of speech enhancement from
feature extraction, how DNN based tools are modelling both global and local
features of speech and model training (supervised and unsupervised). We also
review the use of speech-enhancement pre-trained models to boost speech
enhancement process. The review is geared towards covering the dominant trends
with regards to DNN application in speech enhancement in speech obtained via a
single speaker.Comment: conferenc
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement
Convolution-augmented transformers (Conformers) are recently proposed in
various speech-domain applications, such as automatic speech recognition (ASR)
and speech separation, as they can capture both local and global dependencies.
In this paper, we propose a conformer-based metric generative adversarial
network (CMGAN) for speech enhancement (SE) in the time-frequency (TF) domain.
The generator encodes the magnitude and complex spectrogram information using
two-stage conformer blocks to model both time and frequency dependencies. The
decoder then decouples the estimation into a magnitude mask decoder branch to
filter out unwanted distortions and a complex refinement branch to further
improve the magnitude estimation and implicitly enhance the phase information.
Additionally, we include a metric discriminator to alleviate metric mismatch by
optimizing the generator with respect to a corresponding evaluation score.
Objective and subjective evaluations illustrate that CMGAN is able to show
superior performance compared to state-of-the-art methods in three speech
enhancement tasks (denoising, dereverberation and super-resolution). For
instance, quantitative denoising analysis on Voice Bank+DEMAND dataset
indicates that CMGAN outperforms various previous models with a margin, i.e.,
PESQ of 3.41 and SSNR of 11.10 dB.Comment: 16 pages, 10 figures and 5 tables. arXiv admin note: text overlap
with arXiv:2203.1514
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
Complex Neural Networks for Audio
Audio is represented in two mathematically equivalent ways: the real-valued time domain (i.e., waveform) and the complex-valued frequency domain (i.e., spectrum). There are advantages to the frequency-domain representation, e.g., the human auditory system is known to process sound in the frequency-domain. Furthermore, linear time-invariant systems are convolved with sources in the time-domain, whereas they may be factorized in the frequency-domain. Neural networks have become rather useful when applied to audio tasks such as machine listening and audio synthesis, which are related by their dependencies on high quality acoustic models. They ideally encapsulate fine-scale temporal structure, such as that encoded in the phase of frequency-domain audio, yet there are no authoritative deep learning methods for complex audio. This manuscript is dedicated to addressing the shortcoming. Chapter 2 motivates complex networks by their affinity with complex-domain audio, while Chapter 3 contributes methods for building and optimizing complex networks. We show that the naive implementation of Adam optimization is incorrect for complex random variables and show that selection of input and output representation has a significant impact on the performance of a complex network. Experimental results with novel complex neural architectures are provided in the second half of this manuscript. Chapter 4 introduces a complex model for binaural audio source localization. We show that, like humans, the complex model can generalize to different anatomical filters, which is important in the context of machine listening. The complex model\u27s performance is better than that of the real-valued models, as well as real- and complex-valued baselines. Chapter 5 proposes a two-stage method for speech enhancement. In the first stage, a complex-valued stochastic autoencoder projects complex vectors to a discrete space. In the second stage, long-term temporal dependencies are modeled in the discrete space. The autoencoder raises the performance ceiling for state of the art speech enhancement, but the dynamic enhancement model does not outperform other baselines. We discuss areas for improvement and note that the complex Adam optimizer improves training convergence over the naive implementation
Analysis, Disentanglement, and Conversion of Singing Voice Attributes
Voice conversion is a prominent area of research, which can typically be described as the replacement of acoustic cues that relate to the perceived identity of the voice. Over almost a decade, deep learning has emerged as a transformative solution for this multifaceted task, offering various advancements to address different conditions and challenges in the field. One intriguing avenue for researchers in the field of Music Information Retrieval is singing voice conversion - a task that has only been subjected to neural network analysis and synthesis techniques over the last four years. The conversion of various singing voice attributes introduces new considerations, including working with limited datasets, adhering to musical context restrictions and considering how expression in singing is manifested in such attributes. Voice conversion with respect to singing techniques, for example, has received little attention even though its impact on the music industry would be considerable and important. This thesis therefore delves into problems related to vocal perception, limited datasets, and attribute disentanglement in the pursuit of optimal performance for the conversion of attributes that are scarcely labelled, which are covered across three research chapters. The first of these chapters describes the collection of perceptual pairwise dissimilarity ratings for singing techniques from participants. These were subsequently subjected to clustering algorithms and compared against existing ground truth labels. The results confirm the viability of using existing singing technique-labelled datasets for singing technique conversion (STC) using supervised machine learning strategies. A dataset of dissimilarity ratings and timbral maps was generated, illustrating how register and gender conditions affect perception. The first of these chapters describes the collection of perceptual pairwise dissimilarity ratings for singing techniques from participants. These were subsequently subjected to clustering algorithms and compared against existing ground truth labels. The results confirm the viability of using existing singing technique-labelled datasets for singing technique conversion (STC) using supervised machine learning strategies. A dataset of dissimilarity ratings and timbral maps was generated, illustrating how register and gender conditions affect perception. In response to these findings, an adapted version of an existing voice conversion system in conjunction with an existing labelled dataset was developed. This served as the first implementation of a model for zero-shot STC, although it exhibited varying levels of success. An alternative method of attribute conversion was therefore considered as a means towards performing satisfactorily realistic STC. By refining ‘voice identity’ conversion for singing, future research can be conducted where this attribute, along with more deterministic attributes (such as pitch, loudness, and phonetics) can be disentangled from an input signal, exposing information related to unlabelled attributes. Final experiments in refining the task of voice identity conversion for the singing domain were conducted as a stepping stone towards unlabelled attribute conversion. By performing comparative analyses between different features, singing and speech domains, and alternative loss functions, the most suitable process for singing voice attribute conversion (SVAC) could be established. In summary, this thesis documents a series of experiments that explore different aspects of the singing voice and conversion techniques in the pursuit of devising a convincing SVAC system