3 research outputs found
Phonetic Feedback for Speech Enhancement With and Without Parallel Speech Data
While deep learning systems have gained significant ground in speech
enhancement research, these systems have yet to make use of the full potential
of deep learning systems to provide high-level feedback. In particular,
phonetic feedback is rare in speech enhancement research even though it
includes valuable top-down information. We use the technique of mimic loss to
provide phonetic feedback to an off-the-shelf enhancement system, and find
gains in objective intelligibility scores on CHiME-4 data. This technique takes
a frozen acoustic model trained on clean speech to provide valuable feedback to
the enhancement model, even in the case where no parallel speech data is
available. Our work is one of the first to show intelligibility improvement for
neural enhancement systems without parallel speech data, and we show phonetic
feedback can improve a state-of-the-art neural enhancement system trained with
parallel speech data.Comment: 4 pages + 1 page for references, accepted to ICASSP 202
Semi-Supervised Monaural Singing Voice Separation With a Masking Network Trained on Synthetic Mixtures
We study the problem of semi-supervised singing voice separation, in which
the training data contains a set of samples of mixed music (singing and
instrumental) and an unmatched set of instrumental music. Our solution employs
a single mapping function g, which, applied to a mixed sample, recovers the
underlying instrumental music, and, applied to an instrumental sample, returns
the same sample. The network g is trained using purely instrumental samples, as
well as on synthetic mixed samples that are created by mixing reconstructed
singing voices with random instrumental samples. Our results indicate that we
are on a par with or better than fully supervised methods, which are also
provided with training samples of unmixed singing voices, and are better than
other recent semi-supervised methods
Adversarial representation learning for private speech generation
As more and more data is collected in various settings across organizations,
companies, and countries, there has been an increase in the demand of user
privacy. Developing privacy preserving methods for data analytics is thus an
important area of research. In this work we present a model based on generative
adversarial networks (GANs) that learns to obfuscate specific sensitive
attributes in speech data. We train a model that learns to hide sensitive
information in the data, while preserving the meaning in the utterance. The
model is trained in two steps: first to filter sensitive information in the
spectrogram domain, and then to generate new and private information
independent of the filtered one. The model is based on a U-Net CNN that takes
mel-spectrograms as input. A MelGAN is used to invert the spectrograms back to
raw audio waveforms. We show that it is possible to hide sensitive information
such as gender by generating new data, trained adversarially to maintain
utility and realism.Comment: Submitted to ICML 2020 Workshop on Self-supervision in Audio and
Speech (SAS