17 research outputs found

    Ultrasound based Silent Speech Interface using Deep Learning

    Get PDF
    Silent Speech Interface (SSI) is a technology able to synthesize speech in the absence of any acoustic signal. It can be useful in cases like laryngectomy patients, noisy environments or silent calls. This thesis explores the particular case of SSI using ultrasound images of the tongue as input signals. A 'direct synthesis' approach based on Deep Neural Networks and Mel-generalized cepstral coefficients is proposed. This document is an extension of CsapĂł et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface". Several deep learning models, such as the basic Feed-forward Neural Networks, Convolutional Neural Networks and Recurrent Neural Networks are presented and discussed. A denoising pre-processing based on a Deep Convolutional Autoencoder has also been studied. A considerable number of experiments using a set of different deep learning architectures and an extensive hyperperameter optimization study have been realized. The different experiments have been testing and rating several objective and subjective quality measures. According to the experiments, an architecture based on a CNN and bidirectional LSTM layers has shown the best results in both objective and subjective terms.Silent Speech Interface (SSI) is a technology able to synthesize speech in the absence of any acoustic signal. It can be useful in cases like laryngectomy patients, noisy environments or silent calls. This thesis explores the particular case of SSI using ultrasound images of the tongue as input signals. A 'direct synthesis' approach based on Deep Neural Networks and Mel-generalized cepstral coefficients is proposed. This document is an extension of CsapĂł et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface". Several deep learning models, such as the basic Feed-forward Neural Networks, Convolutional Neural Networks and Recurrent Neural Networks are presented and discussed. A denoising pre-processing based on a Deep Convolutional Autoencoder has also been studied. A considerable number of experiments using a set of different deep learning architectures and an extensive hyperperameter optimization study have been realized. The different experiments have been testing and rating several objective and subjective quality measures. According to the experiments, an architecture based on a CNN and bidirectional LSTM layers has shown the best results in both objective and subjective terms.Silent Speech Interface (SSI) Ă©s una tecnologia capaç de sintetitzar veu partint Ășnicament de senyals no-acĂșstiques. Pot tenir gran utilitat en casos com pacients de laringectomia, ambients sorollosos o trucades silencioses. Aquesta tĂšsis explora el cas particular de SSI utilitzant imatges de la llengua captades amb ultrasons com a senyals d'entrada. Es proposa un enfocament de 'sĂ­ntesis directa' basat en Xarxes Neuronals Profundes i coeficients Mel-generalized cepstral. Aquest document Ă©s una extensiĂł del treball de CsapĂł et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface" . Diversos models de xarxes neuronals sĂłn presentats i discutits, com les bĂ siques xarxes neuronals directes, xarxes neuronals convolucionals o xarxes neuronals recurrents. TambĂ© s'ha estudiat un pre-processat reductor de soroll basat en un Autoencoder convolucional profund. S'ha portat a terme un nombre considerable d'experiments utilitzant diverses arquitectures de Deep Learning, aixĂ­ com un extens estudi d'optimitzaciĂł d'hyperparĂ metres. Els diferents experiments han estat evaluar i qualificar a partir de diferentes mesures de qualitat objectives i subjectives. Els millors resultats, tant en termes objectius com subjectius, els ha presentat una arquitectura basada en una CNN i capes bidireccionals de LSTMs

    Diffusion-Based Audio Inpainting

    Full text link
    Audio inpainting aims to reconstruct missing segments in corrupted recordings. Previous methods produce plausible reconstructions when the gap length is shorter than about 100\;ms, but the quality decreases for longer gaps. This paper explores recent advancements in deep learning and, particularly, diffusion models, for the task of audio inpainting. The proposed method uses an unconditionally trained generative model, which can be conditioned in a zero-shot fashion for audio inpainting, offering high flexibility to regenerate gaps of arbitrary length. An improved deep neural network architecture based on the constant-Q transform, which allows the model to exploit pitch-equivariant symmetries in audio, is also presented. The performance of the proposed algorithm is evaluated through objective and subjective metrics for the task of reconstructing short to mid-sized gaps. The results of a formal listening test show that the proposed method delivers a comparable performance against state-of-the-art for short gaps, while retaining a good audio quality and outperforming the baselines for the longest gap lengths tested, 150\;ms and 200\;ms. This work helps improve the restoration of sound recordings having fairly long local disturbances or dropouts, which must be reconstructed.Comment: Submitted for publication to the Journal of Audio Engineering Society on January 30th, 202

    Zero-Shot Blind Audio Bandwidth Extension

    Full text link
    Audio bandwidth extension involves the realistic reconstruction of high-frequency spectra from bandlimited observations. In cases where the lowpass degradation is unknown, such as in restoring historical audio recordings, this becomes a blind problem. This paper introduces a novel method called BABE (Blind Audio Bandwidth Extension) that addresses the blind problem in a zero-shot setting, leveraging the generative priors of a pre-trained unconditional diffusion model. During the inference process, BABE utilizes a generalized version of diffusion posterior sampling, where the degradation operator is unknown but parametrized and inferred iteratively. The performance of the proposed method is evaluated using objective and subjective metrics, and the results show that BABE surpasses state-of-the-art blind bandwidth extension baselines and achieves competitive performance compared to non-blind filter-informed methods when tested with synthetic data. Moreover, BABE exhibits robust generalization capabilities when enhancing real historical recordings, effectively reconstructing the missing high-frequency content while maintaining coherence with the original recording. Subjective preference tests confirm that BABE significantly improves the audio quality of historical music recordings. Examples of historical recordings restored with the proposed method are available on the companion webpage: (http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/)Comment: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processin

    Solving Audio Inverse Problems with a Diffusion Model

    Full text link
    This paper presents CQT-Diff, a data-driven generative audio model that can, once trained, be used for solving various different audio inverse problems in a problem-agnostic setting. CQT-Diff is a neural diffusion model with an architecture that is carefully constructed to exploit pitch-equivariant symmetries in music. This is achieved by preconditioning the model with an invertible Constant-Q Transform (CQT), whose logarithmically-spaced frequency axis represents pitch equivariance as translation equivariance. The proposed method is evaluated with objective and subjective metrics in three different and varied tasks: audio bandwidth extension, inpainting, and declipping. The results show that CQT-Diff outperforms the compared baselines and ablations in audio bandwidth extension and, without retraining, delivers competitive performance against modern baselines in audio inpainting and declipping. This work represents the first diffusion-based general framework for solving inverse problems in audio processing.Comment: Submitted to ICASSP 202

    A Diffusion-Based Generative Equalizer for Music Restoration

    Full text link
    This paper presents a novel approach to audio restoration, focusing on the enhancement of low-quality music recordings, and in particular historical ones. Building upon a previous algorithm called BABE, or Blind Audio Bandwidth Extension, we introduce BABE-2, which presents a series of significant improvements. This research broadens the concept of bandwidth extension to \emph{generative equalization}, a novel task that, to the best of our knowledge, has not been explicitly addressed in previous studies. BABE-2 is built around an optimization algorithm utilizing priors from diffusion models, which are trained or fine-tuned using a curated set of high-quality music tracks. The algorithm simultaneously performs two critical tasks: estimation of the filter degradation magnitude response and hallucination of the restored audio. The proposed method is objectively evaluated on historical piano recordings, showing a marked enhancement over the prior version. The method yields similarly impressive results in rejuvenating the works of renowned vocalists Enrico Caruso and Nellie Melba. This research represents an advancement in the practical restoration of historical music.Comment: Submitted to DAFx24. Historical music restoration examples are available at: http://research.spa.aalto.fi/publications/papers/dafx-babe2

    Neural modeling of magnetic tape recorders

    Get PDF
    The sound of magnetic recording media, such as open-reel and cassette tape recorders, is still sought after by today's sound practitioners due to the imperfections embedded in the physics of the magnetic recording process. This paper proposes a method for digitally emulating this character using neural networks. The signal chain of the proposed system consists of three main components: the hysteretic nonlinearity and filtering jointly produced by the magnetic recording process as well as the record and playback amplifiers, the fluctuating delay originating from the tape transport, and the combined additive noise component from various electromagnetic origins. In our approach, the hysteretic nonlinear block is modeled using a recurrent neural network, while the delay trajectories and the noise component are generated using separate diffusion models, which employ U-net deep convolutional neural networks. According to the conducted objective evaluation, the proposed architecture faithfully captures the character of the magnetic tape recorder. The results of this study can be used to construct virtual replicas of vintage sound recording devices with applications in music production and audio antiquing tasks

    Noise morphing for audio time stretching

    Get PDF
    This letter introduces an innovative method to enhance the quality of audio time stretching by precisely decomposing a sound into sines, transients, and noise and by improving the processing of the latter component. While there are established methods for time-stretching sines and transients with high quality, the manipulation of noise or residual components has lacked robust solutions in prior research. The proposed method combines sound decomposition with previous techniques for audio spectral resynthesis. The time-stretched noise component is achieved by morphing its time-interpolated spectral magnitude with a white-noise excitation signal. This method stands out for its simplicity, efficiency, and audio quality. The results of a subjective experiment affirm the superiority of this approach over current state-of-the-art methods across all evaluated stretch factors. The proposed technique notably excels in extreme stretching scenarios, signifying a substantial elevation in performance. The proposed method holds promise for a wide range of applications in slow-motion media content, such as music or sports video production

    BUDDy: Single-Channel Blind Unsupervised Dereverberation with Diffusion Models

    Full text link
    In this paper, we present an unsupervised single-channel method for joint blind dereverberation and room impulse response estimation, based on posterior sampling with diffusion models. We parameterize the reverberation operator using a filter with exponential decay for each frequency subband, and iteratively estimate the corresponding parameters as the speech utterance gets refined along the reverse diffusion trajectory. A measurement consistency criterion enforces the fidelity of the generated speech with the reverberant measurement, while an unconditional diffusion model implements a strong prior for clean speech generation. Without any knowledge of the room impulse response nor any coupled reverberant-anechoic data, we can successfully perform dereverberation in various acoustic scenarios. Our method significantly outperforms previous blind unsupervised baselines, and we demonstrate its increased robustness to unseen acoustic conditions in comparison to blind supervised methods. Audio samples and code are available online.Comment: Submitted to IWAENC 202

    Development and evaluation of virtual bass systems for audio enhancement in small loudspeakers

    No full text
    Due to the volume limitations, small loudspeakers cannot appropriately reproduce the lower frequencies that conform bass and percussive components in music. A virtual bass system tricks the human auditory system to create the impression of bass perception, even though the bass frequencies are highly attenuated, or not even physically reproduced. These methods make use of the missing fundamental effect, being able to psychoacoustically extend the low-frequency bandwidth of the signal by adding higher harmonics. It is still a challenge for a virtual bass system to induce the desired effect without deteriorating the audio quality. In the literature, virtual bass systems are implemented either in the time domain, with non-linear processing, or in the frequency domain, by using a phase vocoder algorithm. Hybrid systems separate the original signal into transient and steady-state sounds and process them separately, combining the strengths of both time and frequency domain techniques. This thesis proposes a novel hybrid method based on the fuzzy separation of transients, tones, and also noisy components. It introduces an improved phase vocoder based methodology, which intends to preserve as much as possible the original timbre of the signal. A conducted listening test shows that the proposed method outperforms selected previous algorithms. Moreover, another subjective experiment indicates that, by just processing three harmonics, the algorithm can effectively enhance the bass perception of small loudspeakers without significantly altering the audio quality

    Ultrasound based Silent Speech Interface using Deep Learning

    No full text
    Silent Speech Interface (SSI) is a technology able to synthesize speech in the absence of any acoustic signal. It can be useful in cases like laryngectomy patients, noisy environments or silent calls. This thesis explores the particular case of SSI using ultrasound images of the tongue as input signals. A 'direct synthesis' approach based on Deep Neural Networks and Mel-generalized cepstral coefficients is proposed. This document is an extension of CsapĂł et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface". Several deep learning models, such as the basic Feed-forward Neural Networks, Convolutional Neural Networks and Recurrent Neural Networks are presented and discussed. A denoising pre-processing based on a Deep Convolutional Autoencoder has also been studied. A considerable number of experiments using a set of different deep learning architectures and an extensive hyperperameter optimization study have been realized. The different experiments have been testing and rating several objective and subjective quality measures. According to the experiments, an architecture based on a CNN and bidirectional LSTM layers has shown the best results in both objective and subjective terms.Silent Speech Interface (SSI) is a technology able to synthesize speech in the absence of any acoustic signal. It can be useful in cases like laryngectomy patients, noisy environments or silent calls. This thesis explores the particular case of SSI using ultrasound images of the tongue as input signals. A 'direct synthesis' approach based on Deep Neural Networks and Mel-generalized cepstral coefficients is proposed. This document is an extension of CsapĂł et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface". Several deep learning models, such as the basic Feed-forward Neural Networks, Convolutional Neural Networks and Recurrent Neural Networks are presented and discussed. A denoising pre-processing based on a Deep Convolutional Autoencoder has also been studied. A considerable number of experiments using a set of different deep learning architectures and an extensive hyperperameter optimization study have been realized. The different experiments have been testing and rating several objective and subjective quality measures. According to the experiments, an architecture based on a CNN and bidirectional LSTM layers has shown the best results in both objective and subjective terms.Silent Speech Interface (SSI) Ă©s una tecnologia capaç de sintetitzar veu partint Ășnicament de senyals no-acĂșstiques. Pot tenir gran utilitat en casos com pacients de laringectomia, ambients sorollosos o trucades silencioses. Aquesta tĂšsis explora el cas particular de SSI utilitzant imatges de la llengua captades amb ultrasons com a senyals d'entrada. Es proposa un enfocament de 'sĂ­ntesis directa' basat en Xarxes Neuronals Profundes i coeficients Mel-generalized cepstral. Aquest document Ă©s una extensiĂł del treball de CsapĂł et al. "DNN-based Ultrasound-to-Speech Conversion for a Silent Speech Interface" . Diversos models de xarxes neuronals sĂłn presentats i discutits, com les bĂ siques xarxes neuronals directes, xarxes neuronals convolucionals o xarxes neuronals recurrents. TambĂ© s'ha estudiat un pre-processat reductor de soroll basat en un Autoencoder convolucional profund. S'ha portat a terme un nombre considerable d'experiments utilitzant diverses arquitectures de Deep Learning, aixĂ­ com un extens estudi d'optimitzaciĂł d'hyperparĂ metres. Els diferents experiments han estat evaluar i qualificar a partir de diferentes mesures de qualitat objectives i subjectives. Els millors resultats, tant en termes objectius com subjectius, els ha presentat una arquitectura basada en una CNN i capes bidireccionals de LSTMs
    corecore