17 research outputs found
Ultrasound based Silent Speech Interface using Deep Learning
Silent Speech Interface (SSI) is a technology able to synthesize speech in the absence of any acoustic signal. It can be useful in cases like laryngectomy patients, noisy environments or silent calls. This thesis explores the particular case of SSI using ultrasound images of the tongue as input signals. A 'direct synthesis' approach based on Deep Neural Networks and Mel-generalized cepstral coefficients is proposed. This document is an extension of CsapĂł et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface". Several deep learning models, such as the basic Feed-forward Neural Networks, Convolutional Neural Networks and Recurrent Neural Networks are presented and discussed. A denoising pre-processing based on a Deep Convolutional Autoencoder has also been studied. A considerable number of experiments using a set of different deep learning architectures and an extensive hyperperameter optimization study have been realized. The different experiments have been testing and rating several objective and subjective quality measures. According to the experiments, an architecture based on a CNN and bidirectional LSTM layers has shown the best results in both objective and subjective terms.Silent Speech Interface (SSI) is a technology able to synthesize speech in the absence of any acoustic signal. It can be useful in cases like laryngectomy patients, noisy environments or silent calls. This thesis explores the particular case of SSI using ultrasound images of the tongue as input signals. A 'direct synthesis' approach based on Deep Neural Networks and Mel-generalized cepstral coefficients is proposed. This document is an extension of CsapĂł et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface". Several deep learning models, such as the basic Feed-forward Neural Networks, Convolutional Neural Networks and Recurrent Neural Networks are presented and discussed. A denoising pre-processing based on a Deep Convolutional Autoencoder has also been studied. A considerable number of experiments using a set of different deep learning architectures and an extensive hyperperameter optimization study have been realized. The different experiments have been testing and rating several objective and subjective quality measures. According to the experiments, an architecture based on a CNN and bidirectional LSTM layers has shown the best results in both objective and subjective terms.Silent Speech Interface (SSI) Ă©s una tecnologia capaç de sintetitzar veu partint Ășnicament de senyals no-acĂșstiques. Pot tenir gran utilitat en casos com pacients de laringectomia, ambients sorollosos o trucades silencioses. Aquesta tĂšsis explora el cas particular de SSI utilitzant imatges de la llengua captades amb ultrasons com a senyals d'entrada. Es proposa un enfocament de 'sĂntesis directa' basat en Xarxes Neuronals Profundes i coeficients Mel-generalized cepstral. Aquest document Ă©s una extensiĂł del treball de CsapĂł et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface" . Diversos models de xarxes neuronals sĂłn presentats i discutits, com les bĂ siques xarxes neuronals directes, xarxes neuronals convolucionals o xarxes neuronals recurrents. TambĂ© s'ha estudiat un pre-processat reductor de soroll basat en un Autoencoder convolucional profund. S'ha portat a terme un nombre considerable d'experiments utilitzant diverses arquitectures de Deep Learning, aixĂ com un extens estudi d'optimitzaciĂł d'hyperparĂ metres. Els diferents experiments han estat evaluar i qualificar a partir de diferentes mesures de qualitat objectives i subjectives. Els millors resultats, tant en termes objectius com subjectius, els ha presentat una arquitectura basada en una CNN i capes bidireccionals de LSTMs
Diffusion-Based Audio Inpainting
Audio inpainting aims to reconstruct missing segments in corrupted
recordings. Previous methods produce plausible reconstructions when the gap
length is shorter than about 100\;ms, but the quality decreases for longer
gaps. This paper explores recent advancements in deep learning and,
particularly, diffusion models, for the task of audio inpainting. The proposed
method uses an unconditionally trained generative model, which can be
conditioned in a zero-shot fashion for audio inpainting, offering high
flexibility to regenerate gaps of arbitrary length. An improved deep neural
network architecture based on the constant-Q transform, which allows the model
to exploit pitch-equivariant symmetries in audio, is also presented. The
performance of the proposed algorithm is evaluated through objective and
subjective metrics for the task of reconstructing short to mid-sized gaps. The
results of a formal listening test show that the proposed method delivers a
comparable performance against state-of-the-art for short gaps, while retaining
a good audio quality and outperforming the baselines for the longest gap
lengths tested, 150\;ms and 200\;ms. This work helps improve the restoration of
sound recordings having fairly long local disturbances or dropouts, which must
be reconstructed.Comment: Submitted for publication to the Journal of Audio Engineering Society
on January 30th, 202
Zero-Shot Blind Audio Bandwidth Extension
Audio bandwidth extension involves the realistic reconstruction of
high-frequency spectra from bandlimited observations. In cases where the
lowpass degradation is unknown, such as in restoring historical audio
recordings, this becomes a blind problem. This paper introduces a novel method
called BABE (Blind Audio Bandwidth Extension) that addresses the blind problem
in a zero-shot setting, leveraging the generative priors of a pre-trained
unconditional diffusion model. During the inference process, BABE utilizes a
generalized version of diffusion posterior sampling, where the degradation
operator is unknown but parametrized and inferred iteratively. The performance
of the proposed method is evaluated using objective and subjective metrics, and
the results show that BABE surpasses state-of-the-art blind bandwidth extension
baselines and achieves competitive performance compared to non-blind
filter-informed methods when tested with synthetic data. Moreover, BABE
exhibits robust generalization capabilities when enhancing real historical
recordings, effectively reconstructing the missing high-frequency content while
maintaining coherence with the original recording. Subjective preference tests
confirm that BABE significantly improves the audio quality of historical music
recordings. Examples of historical recordings restored with the proposed method
are available on the companion webpage:
(http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/)Comment: Submitted to IEEE/ACM Transactions on Audio, Speech and Language
Processin
Solving Audio Inverse Problems with a Diffusion Model
This paper presents CQT-Diff, a data-driven generative audio model that can,
once trained, be used for solving various different audio inverse problems in a
problem-agnostic setting. CQT-Diff is a neural diffusion model with an
architecture that is carefully constructed to exploit pitch-equivariant
symmetries in music. This is achieved by preconditioning the model with an
invertible Constant-Q Transform (CQT), whose logarithmically-spaced frequency
axis represents pitch equivariance as translation equivariance. The proposed
method is evaluated with objective and subjective metrics in three different
and varied tasks: audio bandwidth extension, inpainting, and declipping. The
results show that CQT-Diff outperforms the compared baselines and ablations in
audio bandwidth extension and, without retraining, delivers competitive
performance against modern baselines in audio inpainting and declipping. This
work represents the first diffusion-based general framework for solving inverse
problems in audio processing.Comment: Submitted to ICASSP 202
A Diffusion-Based Generative Equalizer for Music Restoration
This paper presents a novel approach to audio restoration, focusing on the
enhancement of low-quality music recordings, and in particular historical ones.
Building upon a previous algorithm called BABE, or Blind Audio Bandwidth
Extension, we introduce BABE-2, which presents a series of significant
improvements. This research broadens the concept of bandwidth extension to
\emph{generative equalization}, a novel task that, to the best of our
knowledge, has not been explicitly addressed in previous studies. BABE-2 is
built around an optimization algorithm utilizing priors from diffusion models,
which are trained or fine-tuned using a curated set of high-quality music
tracks. The algorithm simultaneously performs two critical tasks: estimation of
the filter degradation magnitude response and hallucination of the restored
audio. The proposed method is objectively evaluated on historical piano
recordings, showing a marked enhancement over the prior version. The method
yields similarly impressive results in rejuvenating the works of renowned
vocalists Enrico Caruso and Nellie Melba. This research represents an
advancement in the practical restoration of historical music.Comment: Submitted to DAFx24. Historical music restoration examples are
available at: http://research.spa.aalto.fi/publications/papers/dafx-babe2
Neural modeling of magnetic tape recorders
The sound of magnetic recording media, such as open-reel and cassette tape recorders, is still sought after by today's sound practitioners due to the imperfections embedded in the physics of the magnetic recording process. This paper proposes a method for digitally emulating this character using neural networks. The signal chain of the proposed system consists of three main components: the hysteretic nonlinearity and filtering jointly produced by the magnetic recording process as well as the record and playback amplifiers, the fluctuating delay originating from the tape transport, and the combined additive noise component from various electromagnetic origins. In our approach, the hysteretic nonlinear block is modeled using a recurrent neural network, while the delay trajectories and the noise component are generated using separate diffusion models, which employ U-net deep convolutional neural networks. According to the conducted objective evaluation, the proposed architecture faithfully captures the character of the magnetic tape recorder. The results of this study can be used to construct virtual replicas of vintage sound recording devices with applications in music production and audio antiquing tasks
Noise morphing for audio time stretching
This letter introduces an innovative method to enhance the quality of audio time stretching by precisely decomposing a sound into sines, transients, and noise and by improving the processing of the latter component. While there are established methods for time-stretching sines and transients with high quality, the manipulation of noise or residual components has lacked robust solutions in prior research. The proposed method combines sound decomposition with previous techniques for audio spectral resynthesis. The time-stretched noise component is achieved by morphing its time-interpolated spectral magnitude with a white-noise excitation signal. This method stands out for its simplicity, efficiency, and audio quality. The results of a subjective experiment affirm the superiority of this approach over current state-of-the-art methods across all evaluated stretch factors. The proposed technique notably excels in extreme stretching scenarios, signifying a substantial elevation in performance. The proposed method holds promise for a wide range of applications in slow-motion media content, such as music or sports video production
BUDDy: Single-Channel Blind Unsupervised Dereverberation with Diffusion Models
In this paper, we present an unsupervised single-channel method for joint
blind dereverberation and room impulse response estimation, based on posterior
sampling with diffusion models. We parameterize the reverberation operator
using a filter with exponential decay for each frequency subband, and
iteratively estimate the corresponding parameters as the speech utterance gets
refined along the reverse diffusion trajectory. A measurement consistency
criterion enforces the fidelity of the generated speech with the reverberant
measurement, while an unconditional diffusion model implements a strong prior
for clean speech generation. Without any knowledge of the room impulse response
nor any coupled reverberant-anechoic data, we can successfully perform
dereverberation in various acoustic scenarios. Our method significantly
outperforms previous blind unsupervised baselines, and we demonstrate its
increased robustness to unseen acoustic conditions in comparison to blind
supervised methods. Audio samples and code are available online.Comment: Submitted to IWAENC 202
Development and evaluation of virtual bass systems for audio enhancement in small loudspeakers
Due to the volume limitations, small loudspeakers cannot appropriately reproduce the lower frequencies that conform bass and percussive components in music. A virtual bass system tricks the human auditory system to create the impression of bass perception, even though the bass frequencies are highly attenuated, or not even physically reproduced. These methods make use of the missing fundamental effect, being able to psychoacoustically extend the low-frequency bandwidth of the signal by adding higher harmonics. It is still a challenge for a virtual bass system to induce the desired effect without deteriorating the audio quality. In the literature, virtual bass systems are implemented either in the time domain, with non-linear processing, or in the frequency domain, by using a phase vocoder algorithm. Hybrid systems separate the original signal into transient and steady-state sounds and process them separately, combining the strengths of both time and frequency domain techniques. This thesis proposes a novel hybrid method based on the fuzzy separation of transients, tones, and also noisy components. It introduces an improved phase vocoder based methodology, which intends to preserve as much as possible the original timbre of the signal. A conducted listening test shows that the proposed method outperforms selected previous algorithms. Moreover, another subjective experiment indicates that, by just processing three harmonics, the algorithm can effectively enhance the bass perception of small loudspeakers without significantly altering the audio quality
Ultrasound based Silent Speech Interface using Deep Learning
Silent Speech Interface (SSI) is a technology able to synthesize speech in the absence of any acoustic signal. It can be useful in cases like laryngectomy patients, noisy environments or silent calls. This thesis explores the particular case of SSI using ultrasound images of the tongue as input signals. A 'direct synthesis' approach based on Deep Neural Networks and Mel-generalized cepstral coefficients is proposed. This document is an extension of CsapĂł et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface". Several deep learning models, such as the basic Feed-forward Neural Networks, Convolutional Neural Networks and Recurrent Neural Networks are presented and discussed. A denoising pre-processing based on a Deep Convolutional Autoencoder has also been studied. A considerable number of experiments using a set of different deep learning architectures and an extensive hyperperameter optimization study have been realized. The different experiments have been testing and rating several objective and subjective quality measures. According to the experiments, an architecture based on a CNN and bidirectional LSTM layers has shown the best results in both objective and subjective terms.Silent Speech Interface (SSI) is a technology able to synthesize speech in the absence of any acoustic signal. It can be useful in cases like laryngectomy patients, noisy environments or silent calls. This thesis explores the particular case of SSI using ultrasound images of the tongue as input signals. A 'direct synthesis' approach based on Deep Neural Networks and Mel-generalized cepstral coefficients is proposed. This document is an extension of CsapĂł et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface". Several deep learning models, such as the basic Feed-forward Neural Networks, Convolutional Neural Networks and Recurrent Neural Networks are presented and discussed. A denoising pre-processing based on a Deep Convolutional Autoencoder has also been studied. A considerable number of experiments using a set of different deep learning architectures and an extensive hyperperameter optimization study have been realized. The different experiments have been testing and rating several objective and subjective quality measures. According to the experiments, an architecture based on a CNN and bidirectional LSTM layers has shown the best results in both objective and subjective terms.Silent Speech Interface (SSI) Ă©s una tecnologia capaç de sintetitzar veu partint Ășnicament de senyals no-acĂșstiques. Pot tenir gran utilitat en casos com pacients de laringectomia, ambients sorollosos o trucades silencioses. Aquesta tĂšsis explora el cas particular de SSI utilitzant imatges de la llengua captades amb ultrasons com a senyals d'entrada. Es proposa un enfocament de 'sĂntesis directa' basat en Xarxes Neuronals Profundes i coeficients Mel-generalized cepstral. Aquest document Ă©s una extensiĂł del treball de CsapĂł et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface" . Diversos models de xarxes neuronals sĂłn presentats i discutits, com les bĂ siques xarxes neuronals directes, xarxes neuronals convolucionals o xarxes neuronals recurrents. TambĂ© s'ha estudiat un pre-processat reductor de soroll basat en un Autoencoder convolucional profund. S'ha portat a terme un nombre considerable d'experiments utilitzant diverses arquitectures de Deep Learning, aixĂ com un extens estudi d'optimitzaciĂł d'hyperparĂ metres. Els diferents experiments han estat evaluar i qualificar a partir de diferentes mesures de qualitat objectives i subjectives. Els millors resultats, tant en termes objectius com subjectius, els ha presentat una arquitectura basada en una CNN i capes bidireccionals de LSTMs