2,778 research outputs found
Recommended from our members
Comparison of effects on subjective intelligibility and quality of speech in babble for two algorithms: A deep recurrent neural network and spectral subtraction.
The effects on speech intelligibility and sound quality of two noise-reduction algorithms were compared: a deep recurrent neural network (RNN) and spectral subtraction (SS). The RNN was trained using sentences spoken by a large number of talkers with a variety of accents, presented in babble. Different talkers were used for testing. Participants with mild-to-moderate hearing loss were tested. Stimuli were given frequency-dependent linear amplification to compensate for the individual hearing losses. A paired-comparison procedure was used to compare all possible combinations of three conditions. The conditions were: speech in babble with no processing (NP) or processed using the RNN or SS. In each trial, the same sentence was played twice using two different conditions. The participants indicated which one was better and by how much in terms of speech intelligibility and (in separate blocks) sound quality. Processing using the RNN was significantly preferred over NP and over SS processing for both subjective intelligibility and sound quality, although the magnitude of the preferences was small. SS processing was not significantly preferred over NP for either subjective intelligibility or sound quality. Objective computational measures of speech intelligibility predicted better intelligibility for RNN than for SS or NP
Wind Noise Reduction with a Diffusion-based Stochastic Regeneration Model
In this paper we present a method for single-channel wind noise reduction
using our previously proposed diffusion-based stochastic regeneration model
combining predictive and generative modelling. We introduce a non-additive
speech in noise model to account for the non-linear deformation of the membrane
caused by the wind flow and possible clipping. We show that our stochastic
regeneration model outperforms other neural-network-based wind noise reduction
methods as well as purely predictive and generative models, on a dataset using
simulated and real-recorded wind noise. We further show that the proposed
method generalizes well by testing on an unseen dataset with real-recorded wind
noise. Audio samples, data generation scripts and code for the proposed methods
can be found online (https://uhh.de/inf-sp-storm-wind).Comment: Submitted to VDE 15th ITG conference on Speech Communicatio
Single-Microphone Speech Enhancement and Separation Using Deep Learning
The cocktail party problem comprises the challenging task of understanding a
speech signal in a complex acoustic environment, where multiple speakers and
background noise signals simultaneously interfere with the speech signal of
interest. A signal processing algorithm that can effectively increase the
speech intelligibility and quality of speech signals in such complicated
acoustic situations is highly desirable. Especially for applications involving
mobile communication devices and hearing assistive devices. Due to the
re-emergence of machine learning techniques, today, known as deep learning, the
challenges involved with such algorithms might be overcome. In this PhD thesis,
we study and develop deep learning-based techniques for two sub-disciplines of
the cocktail party problem: single-microphone speech enhancement and
single-microphone multi-talker speech separation. Specifically, we conduct
in-depth empirical analysis of the generalizability capability of modern deep
learning-based single-microphone speech enhancement algorithms. We show that
performance of such algorithms is closely linked to the training data, and good
generalizability can be achieved with carefully designed training data.
Furthermore, we propose uPIT, a deep learning-based algorithm for
single-microphone speech separation and we report state-of-the-art results on a
speaker-independent multi-talker speech separation task. Additionally, we show
that uPIT works well for joint speech separation and enhancement without
explicit prior knowledge about the noise type or number of speakers. Finally,
we show that deep learning-based speech enhancement algorithms designed to
minimize the classical short-time spectral amplitude mean squared error leads
to enhanced speech signals which are essentially optimal in terms of STOI, a
state-of-the-art speech intelligibility estimator.Comment: PhD Thesis. 233 page
Deep Learning applied to Visual Speech Recognition
Visual Speech Recognition (VSR) or Automatic Lip-Reading (ALR), the artificial process used to infer visemes, words, or sentences from video inputs, is an efficient yet far from being a day-to-day tool. With the evolution of deep learning models and the proliferation of databases (DB), vocabularies increase in quality and quantity. Large DB feed end-to-end deep learning (DL) models that extract speech, solely on the visual recognition of the speaker’s lips movements. However, large DB production requires large resources, unavailable to the majority of ALR researchers, impairing a larger scale evolution.
This dissertation contributes to the development of ALR by diversifying training data, on which the DL depends upon. This includes producing a new DB, in Portuguese language, capable of state-of-the-art (SOTA) performance. As DL only shows a SOTA performance if trained on a large DB, whose resources are not on the scope of this dissertation, a knowledge leveraging method emerges, as a necessary subsequent objective.
A large DB and a SOTA model are selected and used as templates, from which a smaller DB (LusaPt) is created, comprising 100 phrases by 10 speakers, uttering 50 typical Portuguese digits and words, recorded and processed by day-to-day equipment. After having pre-trained on the SOTA DB, the new model is then fine-tuned on the new DB. For LusaPt’s validation, the performance of new and the SOTA’s are compared.
Results reveal that, if the same video is recurrently subject to the same model, the same prediction is obtained. Tests also show a clear increase on the word recognition rate (WRR), from the 0% when inferring with the SOTA model with no further training on the new DB, to an over 95% when inferring with the new model.
Besides showing a “powerful belief” of the SOTA model in its predictions, this work also validates the new DB and its creation methodology. It reenforces that the transfer learning process is efficient in learning a new language, therefore new words. Another contribution is to demonstrate that, with a day-to-day equipment and limited human resources, it is possible to enrich the DB corpora and, ultimately, to positively impact the performance and future of Automatic Lip-Reading
Recurrent neural networks for multi-microphone speech separation
This thesis takes the classical signal processing problem of separating the speech of a target speaker from a real-world audio recording containing noise, background interference — from competing speech or other non-speech sources —, and reverberation, and seeks data-driven solutions based on supervised learning methods, particularly recurrent neural networks (RNNs). Such speech separation methods can inject robustness in automatic speech recognition (ASR) systems and have been an active area of research for the past two decades. We particularly focus on applications where multi-channel recordings are available.
Stand-alone beamformers cannot simultaneously suppress diffuse-noise and protect the desired signal from any distortions. Post-filters complement the beamformers in obtaining the minimum mean squared error (MMSE) estimate of the desired signal. Time-frequency (TF) masking — a method having roots in computational auditory scene analysis (CASA) — is a suitable candidate for post-filtering, but the challenge lies in estimating the TF masks. The use of RNNs — in particular the bi-directional long short-term memory (BLSTM) architecture — as a post-filter estimating TF masks for a delay-and-sum beamformer (DSB) — using magnitude spectral and phase-based features — is proposed.
The data—recorded in 4 challenging realistic environments—from the CHiME-3 challenge is used. Two different TF masks — Wiener filter and log-ratio — are identified as suitable targets for learning. The separated speech is evaluated based on objective speech intelligibility measures: short-term objective intelligibility (STOI) and frequency-weighted segmental SNR (fwSNR). The word error rates (WERs) as reported by the previous state-of-the-art ASR back-end — when fed with the test data of the CHiME-3 challenge — are interpreted against the objective scores for understanding the relationships of the latter with the former. Overall, a consistent improvement in the objective scores brought in by the RNNs is observed compared to that of feed-forward neural networks and a baseline MVDR beamformer
- …