1,480 research outputs found
Single channel speech music separation using nonnegative matrix factorization and spectral masks
A single channel speech-music separation algorithm based on nonnegative matrix factorization (NMF) with spectral masks is proposed in this work. The proposed algorithm uses training data of speech and music signals with nonnegative matrix factorization followed by masking to separate the mixed signal. In the training stage, NMF uses the training data to train a set of basis vectors for each source. These bases are trained using NMF in the magnitude spectrum domain. After observing the mixed signal, NMF is used to decompose its magnitude spectra into a linear combination of the trained bases for both sources. The decomposition results are used to build a mask, which explains the contribution of each source in the mixed signal. Experimental results show that using masks after NMF improves the separation process even when calculating NMF with fewer iterations, which yields a faster separation process
Single channel speech music separation using nonnegative matrix factorization with sliding windows and spectral masks
A single channel speech-music separation algorithm based on nonnegative matrix factorization (NMF) with sliding windows and spectral masks is proposed in this work. We train a set of basis vectors for each source signal using NMF in the magnitude spectral domain. Rather than forming the columns of the matrices to be decomposed by NMF of a single spectral frame, we build them with multiple spectral frames stacked in one column. After observing the mixed signal, NMF is used to decompose its magnitude spectra into a weighted linear combination of the trained basis vectors for both sources. An initial spectrogram estimate for each source is found, and a spectral mask is built using these initial estimates. This mask is used to weight the mixed signal spectrogram to find the contributions of each source signal in the mixed signal. The method is shown to perform better than the conventional NMF approach
Audio-visual speech recognition with background music using single-channel source separation
In this paper, we consider audio-visual speech recognition with background music. The proposed algorithm is an integration of audio-visual speech recognition and single channel source separation (SCSS). We apply the proposed algorithm to recognize spoken speech that is mixed with music signals. First, the SCSS algorithm based on nonnegative matrix factorization (NMF) and spectral masks is used to separate the audio speech signal from the background music in magnitude spectral domain. After speech audio is separated from music, regular audio-visual speech recognition (AVSR) is employed using multi-stream hidden
Markov models. Employing two approaches together, we try to improve recognition accuracy by both processing the audio signal with SCSS and supporting the recognition task with visual information. Experimental results show that combining audio-visual speech recognition with source separation gives remarkable improvements in the accuracy of the speech recognition system
Deep Remix: Remixing Musical Mixtures Using a Convolutional Deep Neural Network
Audio source separation is a difficult machine learning problem and
performance is measured by comparing extracted signals with the component
source signals. However, if separation is motivated by the ultimate goal of
re-mixing then complete separation is not necessary and hence separation
difficulty and separation quality are dependent on the nature of the re-mix.
Here, we use a convolutional deep neural network (DNN), trained to estimate
'ideal' binary masks for separating voice from music, to perform re-mixing of
the vocal balance by operating directly on the individual magnitude components
of the musical mixture spectrogram. Our results demonstrate that small changes
in vocal gain may be applied with very little distortion to the ultimate
re-mix. Our method may be useful for re-mixing existing mixes
Self-Supervised Audio-Visual Co-Segmentation
Segmenting objects in images and separating sound sources in audio are
challenging tasks, in part because traditional approaches require large amounts
of labeled data. In this paper we develop a neural network model for visual
object segmentation and sound source separation that learns from natural videos
through self-supervision. The model is an extension of recently proposed work
that maps image pixels to sounds. Here, we introduce a learning approach to
disentangle concepts in the neural networks, and assign semantic categories to
network feature channels to enable independent image segmentation and sound
source separation after audio-visual training on videos. Our evaluations show
that the disentangled model outperforms several baselines in semantic
segmentation and sound source separation.Comment: Accepted to ICASSP 201
- …