8 research outputs found
Collapsed speech segment detection and suppression for WaveNet vocoder
In this paper, we propose a technique to alleviate the quality degradation
caused by collapsed speech segments sometimes generated by the WaveNet vocoder.
The effectiveness of the WaveNet vocoder for generating natural speech from
acoustic features has been proved in recent works. However, it sometimes
generates very noisy speech with collapsed speech segments when only a limited
amount of training data is available or significant acoustic mismatches exist
between the training and testing data. Such a limitation on the corpus and
limited ability of the model can easily occur in some speech generation
applications, such as voice conversion and speech enhancement. To address this
problem, we propose a technique to automatically detect collapsed speech
segments. Moreover, to refine the detected segments, we also propose a waveform
generation technique for WaveNet using a linear predictive coding constraint.
Verification and subjective tests are conducted to investigate the
effectiveness of the proposed techniques. The verification results indicate
that the detection technique can detect most collapsed segments. The subjective
evaluations of voice conversion demonstrate that the generation technique
significantly improves the speech quality while maintaining the same speaker
similarity.Comment: 5 pages, 6 figures. Proc. Interspeech, 201
Refined WaveNet Vocoder for Variational Autoencoder Based Voice Conversion
This paper presents a refinement framework of WaveNet vocoders for
variational autoencoder (VAE) based voice conversion (VC), which reduces the
quality distortion caused by the mismatch between the training data and testing
data. Conventional WaveNet vocoders are trained with natural acoustic features
but conditioned on the converted features in the conversion stage for VC, and
such a mismatch often causes significant quality and similarity degradation. In
this work, we take advantage of the particular structure of VAEs to refine
WaveNet vocoders with the self-reconstructed features generated by VAE, which
are of similar characteristics with the converted features while having the
same temporal structure with the target natural features. We analyze these
features and show that the self-reconstructed features are similar to the
converted features. Objective and subjective experimental results demonstrate
the effectiveness of our proposed framework.Comment: 5 pages, 7 figures, 1 table. Accepted to EUSIPCO 201
Risky Action Recognition in Lane Change Video Clips using Deep Spatiotemporal Networks with Segmentation Mask Transfer
Advanced driver assistance and automated driving systems rely on risk
estimation modules to predict and avoid dangerous situations. Current methods
use expensive sensor setups and complex processing pipeline, limiting their
availability and robustness. To address these issues, we introduce a novel deep
learning based action recognition framework for classifying dangerous lane
change behavior in short video clips captured by a monocular camera. We
designed a deep spatiotemporal classification network that uses pre-trained
state-of-the-art instance segmentation network Mask R-CNN as its spatial
feature extractor for this task. The Long-Short Term Memory (LSTM) and
shallower final classification layers of the proposed method were trained on a
semi-naturalistic lane change dataset with annotated risk labels. A
comprehensive comparison of state-of-the-art feature extractors was carried out
to find the best network layout and training strategy. The best result, with a
0.937 AUC score, was obtained with the proposed network. Our code and trained
models are available open-source.Comment: 8 pages, 3 figures, 1 table. The code is open-sourc