2,814 research outputs found
Microphone Array Speech Enhancement Via Beamforming Based Deep Learning Network
In general, in-car speech enhancement is an application of the microphone array speech enhancement in particular acoustic environments. Speech enhancement inside the moving cars is always an interesting topic and the researchers work to create some modules to increase the quality of speech and intelligibility of speech in cars. The passenger dialogue inside the car, the sound of other equipment, and a wide range of interference effects are major challenges in the task of speech separation in-car environment. To overcome this issue, a novel Beamforming based Deep learning Network (Bf-DLN) has been proposed for speech enhancement. Initially, the captured microphone array signals are pre-processed using an Adaptive beamforming technique named Least Constrained Minimum Variance (LCMV). Consequently, the proposed method uses a time-frequency representation to transform the pre-processed data into an image. The smoothed pseudo-Wigner-Ville distribution (SPWVD) is used for converting time-domain speech inputs into images. Convolutional deep belief network (CDBN) is used to extract the most pertinent features from these transformed images. Enhanced Elephant Heard Algorithm (EEHA) is used for selecting the desired source by eliminating the interference source. The experimental result demonstrates the effectiveness of the proposed strategy in removing background noise from the original speech signal. The proposed strategy outperforms existing methods in terms of PESQ, STOI, SSNRI, and SNR. The PESQ of the proposed Bf-DLN has a maximum PESQ of 1.98, whereas existing models like Two-stage Bi-LSTM has 1.82, DNN-C has 1.75 and GCN has 1.68 respectively. The PESQ of the proposed method is 1.75%, 3.15%, and 4.22% better than the existing GCN, DNN-C, and Bi-LSTM techniques. The efficacy of the proposed method is then validated by experiments
Deep Complex U-Net with Conformer for Audio-Visual Speech Enhancement
Recent studies have increasingly acknowledged the advantages of incorporating
visual data into speech enhancement (SE) systems. In this paper, we introduce a
novel audio-visual SE approach, termed DCUC-Net (deep complex U-Net with
conformer network). The proposed DCUC-Net leverages complex domain features and
a stack of conformer blocks. The encoder and decoder of DCUC-Net are designed
using a complex U-Net-based framework. The audio and visual signals are
processed using a complex encoder and a ResNet-18 model, respectively. These
processed signals are then fused using the conformer blocks and transformed
into enhanced speech waveforms via a complex decoder. The conformer blocks
consist of a combination of self-attention mechanisms and convolutional
operations, enabling DCUC-Net to effectively capture both global and local
audio-visual dependencies. Our experimental results demonstrate the
effectiveness of DCUC-Net, as it outperforms the baseline model from the
COG-MHEAR AVSE Challenge 2023 by a notable margin of 0.14 in terms of PESQ.
Additionally, the proposed DCUC-Net performs comparably to a state-of-the-art
model and outperforms all other compared models on the Taiwan Mandarin speech
with video (TMSV) dataset
Audio-Visual Speech Separation in Noisy Environments with a Lightweight Iterative Model
We propose Audio-Visual Lightweight ITerative model (AVLIT), an effective and
lightweight neural network that uses Progressive Learning (PL) to perform
audio-visual speech separation in noisy environments. To this end, we adopt the
Asynchronous Fully Recurrent Convolutional Neural Network (A-FRCNN), which has
shown successful results in audio-only speech separation. Our architecture
consists of an audio branch and a video branch, with iterative A-FRCNN blocks
sharing weights for each modality. We evaluated our model in a controlled
environment using the NTCD-TIMIT dataset and in-the-wild using a synthetic
dataset that combines LRS3 and WHAM!. The experiments demonstrate the
superiority of our model in both settings with respect to various audio-only
and audio-visual baselines. Furthermore, the reduced footprint of our model
makes it suitable for low resource applications.Comment: Accepted by Interspeech 202
Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement
Personalised speech enhancement (PSE), which extracts only the speech of a
target user and removes everything else from a recorded audio clip, can
potentially improve users' experiences of audio AI modules deployed in the
wild. To support a large variety of downstream audio tasks, such as real-time
ASR and audio-call enhancement, a PSE solution should operate in a streaming
mode, i.e., input audio cleaning should happen in real-time with a small
latency and real-time factor. Personalisation is typically achieved by
extracting a target speaker's voice profile from an enrolment audio, in the
form of a static embedding vector, and then using it to condition the output of
a PSE model. However, a fixed target speaker embedding may not be optimal under
all conditions. In this work, we present a streaming Transformer-based PSE
model and propose a novel cross-attention approach that gives adaptive target
speaker representations. We present extensive experiments and show that our
proposed cross-attention approach outperforms competitive baselines
consistently, even when our model is only approximately half the size
A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction
Speaker extraction algorithm extracts the target speech from a mixture speech
containing interference speech and background noise. The extraction process
sometimes over-suppresses the extracted target speech, which not only creates
artifacts during listening but also harms the performance of downstream
automatic speech recognition algorithms. We propose a hybrid continuity loss
function for time-domain speaker extraction algorithms to settle the
over-suppression problem. On top of the waveform-level loss used for superior
signal quality, i.e., SI-SDR, we introduce a multi-resolution delta spectrum
loss in the frequency-domain, to ensure the continuity of an extracted speech
signal, thus alleviating the over-suppression. We examine the hybrid continuity
loss function using a time-domain audio-visual speaker extraction algorithm on
the YouTube LRS2-BBC dataset. Experimental results show that the proposed loss
function reduces the over-suppression and improves the word error rate of
speech recognition on both clean and noisy two-speakers mixtures, without
harming the reconstructed speech quality.Comment: Submitted to Interspeech202
- …