2,256 research outputs found
FaceFilter: Audio-visual speech separation using still images
The objective of this paper is to separate a target speaker's speech from a
mixture of two speakers using a deep audio-visual speech separation network.
Unlike previous works that used lip movement on video clips or pre-enrolled
speaker information as an auxiliary conditional feature, we use a single face
image of the target speaker. In this task, the conditional feature is obtained
from facial appearance in cross-modal biometric task, where audio and visual
identity representations are shared in latent space. Learnt identities from
facial images enforce the network to isolate matched speakers and extract the
voices from mixed speech. It solves the permutation problem caused by swapped
channel outputs, frequently occurred in speech separation tasks. The proposed
method is far more practical than video-based speech separation since user
profile images are readily available on many platforms. Also, unlike
speaker-aware separation methods, it is applicable on separation with unseen
speakers who have never been enrolled before. We show strong qualitative and
quantitative results on challenging real-world examples.Comment: Under submission as a conference paper. Video examples:
https://youtu.be/ku9xoLh62
Cross-Attention is all you need: Real-Time Streaming Transformers for Personalised Speech Enhancement
Personalised speech enhancement (PSE), which extracts only the speech of a
target user and removes everything else from a recorded audio clip, can
potentially improve users' experiences of audio AI modules deployed in the
wild. To support a large variety of downstream audio tasks, such as real-time
ASR and audio-call enhancement, a PSE solution should operate in a streaming
mode, i.e., input audio cleaning should happen in real-time with a small
latency and real-time factor. Personalisation is typically achieved by
extracting a target speaker's voice profile from an enrolment audio, in the
form of a static embedding vector, and then using it to condition the output of
a PSE model. However, a fixed target speaker embedding may not be optimal under
all conditions. In this work, we present a streaming Transformer-based PSE
model and propose a novel cross-attention approach that gives adaptive target
speaker representations. We present extensive experiments and show that our
proposed cross-attention approach outperforms competitive baselines
consistently, even when our model is only approximately half the size
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Hierarchical speaker representation for target speaker extraction
Target speaker extraction aims to isolate a specific speaker's voice from a
composite of multiple sound sources, guided by an enrollment utterance or
called anchor. Current methods predominantly derive speaker embeddings from the
anchor and integrate them into the separation network to separate the voice of
the target speaker. However, the representation of the speaker embedding is too
simplistic, often being merely a 1*1024 vector. This dense information makes it
difficult for the separation network to harness effectively. To address this
limitation, we introduce a pioneering methodology called Hierarchical
Representation (HR) that seamlessly fuses anchor data across granular and
overarching 5 layers of the separation network, enhancing the precision of
target extraction. HR amplifies the efficacy of anchors to improve target
speaker isolation. On the Libri-2talker dataset, HR substantially outperforms
state-of-the-art time-frequency domain techniques. Further demonstrating HR's
capabilities, we achieved first place in the prestigious ICASSP 2023 Deep Noise
Suppression Challenge. The proposed HR methodology shows great promise for
advancing target speaker extraction through enhanced anchor utilization.Comment: Accepted to ICASSP 202
Probing Self-supervised Learning Models with Target Speech Extraction
Large-scale pre-trained self-supervised learning (SSL) models have shown
remarkable advancements in speech-related tasks. However, the utilization of
these models in complex multi-talker scenarios, such as extracting a target
speaker in a mixture, is yet to be fully evaluated. In this paper, we introduce
target speech extraction (TSE) as a novel downstream task to evaluate the
feature extraction capabilities of pre-trained SSL models. TSE uniquely
requires both speaker identification and speech separation, distinguishing it
from other tasks in the Speech processing Universal PERformance Benchmark
(SUPERB) evaluation. Specifically, we propose a TSE downstream model composed
of two lightweight task-oriented modules based on the same frozen SSL model.
One module functions as a speaker encoder to obtain target speaker information
from an enrollment speech, while the other estimates the target speaker's mask
to extract its speech from the mixture. Experimental results on the Libri2mix
datasets reveal the relevance of the TSE downstream task to probe SSL models,
as its performance cannot be simply deduced from other related tasks such as
speaker verification and separation.Comment: Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and
Beyond (SASB) worksho
- …