111 research outputs found
Speech Separation based on Contrastive Learning and Deep Modularization
The current monaural state of the art tools for speech separation relies on
supervised learning. This means that they must deal with permutation problem,
they are impacted by the mismatch on the number of speakers used in training
and inference. Moreover, their performance heavily relies on the presence of
high-quality labelled data. These problems can be effectively addressed by
employing a fully unsupervised technique for speech separation. In this paper,
we use contrastive learning to establish the representations of frames then use
the learned representations in the downstream deep modularization task.
Concretely, we demonstrate experimentally that in speech separation, different
frames of a speaker can be viewed as augmentations of a given hidden standard
frame of that speaker. The frames of a speaker contain enough prosodic
information overlap which is key in speech separation. Based on this, we
implement a self-supervised learning to learn to minimize the distance between
frames belonging to a given speaker. The learned representations are used in a
downstream deep modularization task to cluster frames based on speaker
identity. Evaluation of the developed technique on WSJ0-2mix and WSJ0-3mix
shows that the technique attains SI-SNRi and SDRi of 20.8 and 21.0 respectively
in WSJ0-2mix. In WSJ0-3mix, it attains SI-SNRi and SDRi of 20.7 and 20.7
respectively in WSJ0-2mix. Its greatest strength being that as the number of
speakers increase, its performance does not degrade significantly.Comment: arXiv admin note: substantial text overlap with arXiv:2212.0036
Audio-Visual Learning for Scene Understanding
Multimodal deep learning aims at combining the complementary information of different modalities. Among all modalities, audio and video are the predominant ones that humans use to explore the world. In this thesis, we decided to focus our study on audio-visual deep learning to mimic with our networks how humans perceive the world.
Our research includes images, audio signals and acoustic images. The latter provide spatial audio information and are obtained from a planar array of microphones combining their raw audios with the beamforming algorithm. They better mimic human auditory systems, which cannot be replicated using just one microphone, not able alone to give spatial sound cues.
However, as microphones arrays are not so widespread, we also study how to handle the missing spatialized audio modality at test time.
As a solution, we propose to distill acoustic images content to audio features during the training in order to handle their absence at test time. This is done for supervised audio classification using the generalized distillation framework, which we also extend for self-supervised learning.
Next, we devise a method for reconstructing acoustic images given a single microphone and an RGB frame. Therefore, in case we just dispose of a standard video, we are able to synthesize spatial audio, which is useful for many audio-visual tasks, including sound localization.
Lastly, as another example of restoring one modality from available ones, we inpaint degraded images providing audio features, to reconstruct the missing region not only to be visually plausible but also semantically consistent with the related sound. This includes also cross-modal generation, in the limit case of completely missing or hidden visual modality: our method naturally deals with it, being able to generate images from sound.
In summary we show how audio can help visual learning and vice versa, by transferring knowledge between the two modalities at training time, in order to distill, reconstruct, or restore the missing modality at test time
Self-Supervised Learning for Personalized Speech Enhancement
Speech enhancement systems can show improved performance by adapting the
model towards a single test-time speaker. In this personalization context, the
test-time user might only provide a small amount of noise-free speech data,
likely insufficient for traditional fully-supervised learning. One way to
overcome the lack of personal data is to transfer the model parameters from a
speaker-agnostic model to initialize the personalized model, and then to
finetune the model using the small amount of personal speech data. This
baseline marginally adapts over the scarce clean speech data. Alternatively, we
propose self-supervised methods that are designed specifically to learn
personalized and discriminative features from abundant in-the-wild noisy, but
still personal speech recordings. Our experiment shows that the proposed
self-supervised learning methods initialize personalized speech enhancement
models better than the baseline fully-supervised methods, yielding superior
speech enhancement performance. The proposed methods also result in a more
robust feature set under the real-world conditions: compressed model sizes and
fewness of the labeled data.Comment: 10 pages, 5 figures, under revie
Single-Microphone Speech Enhancement and Separation Using Deep Learning
The cocktail party problem comprises the challenging task of understanding a
speech signal in a complex acoustic environment, where multiple speakers and
background noise signals simultaneously interfere with the speech signal of
interest. A signal processing algorithm that can effectively increase the
speech intelligibility and quality of speech signals in such complicated
acoustic situations is highly desirable. Especially for applications involving
mobile communication devices and hearing assistive devices. Due to the
re-emergence of machine learning techniques, today, known as deep learning, the
challenges involved with such algorithms might be overcome. In this PhD thesis,
we study and develop deep learning-based techniques for two sub-disciplines of
the cocktail party problem: single-microphone speech enhancement and
single-microphone multi-talker speech separation. Specifically, we conduct
in-depth empirical analysis of the generalizability capability of modern deep
learning-based single-microphone speech enhancement algorithms. We show that
performance of such algorithms is closely linked to the training data, and good
generalizability can be achieved with carefully designed training data.
Furthermore, we propose uPIT, a deep learning-based algorithm for
single-microphone speech separation and we report state-of-the-art results on a
speaker-independent multi-talker speech separation task. Additionally, we show
that uPIT works well for joint speech separation and enhancement without
explicit prior knowledge about the noise type or number of speakers. Finally,
we show that deep learning-based speech enhancement algorithms designed to
minimize the classical short-time spectral amplitude mean squared error leads
to enhanced speech signals which are essentially optimal in terms of STOI, a
state-of-the-art speech intelligibility estimator.Comment: PhD Thesis. 233 page
- …