2,858 research outputs found
Simulating dysarthric speech for training data augmentation in clinical speech applications
Training machine learning algorithms for speech applications requires large,
labeled training data sets. This is problematic for clinical applications where
obtaining such data is prohibitively expensive because of privacy concerns or
lack of access. As a result, clinical speech applications are typically
developed using small data sets with only tens of speakers. In this paper, we
propose a method for simulating training data for clinical applications by
transforming healthy speech to dysarthric speech using adversarial training. We
evaluate the efficacy of our approach using both objective and subjective
criteria. We present the transformed samples to five experienced
speech-language pathologists (SLPs) and ask them to identify the samples as
healthy or dysarthric. The results reveal that the SLPs identify the
transformed speech as dysarthric 65% of the time. In a pilot classification
experiment, we show that by using the simulated speech samples to balance an
existing dataset, the classification accuracy improves by about 10% after data
augmentation.Comment: Will appear in Proc. of ICASSP 201
A critical analysis of self-supervision, or what we can learn from a single image
We look critically at popular self-supervision techniques for learning deep
convolutional neural networks without manual labels. We show that three
different and representative methods, BiGAN, RotNet and DeepCluster, can learn
the first few layers of a convolutional network from a single image as well as
using millions of images and manual labels, provided that strong data
augmentation is used. However, for deeper layers the gap with manual
supervision cannot be closed even if millions of unlabelled images are used for
training. We conclude that: (1) the weights of the early layers of deep
networks contain limited information about the statistics of natural images,
that (2) such low-level statistics can be learned through self-supervision just
as well as through strong supervision, and that (3) the low-level statistics
can be captured via synthetic transformations instead of using a large image
dataset.Comment: Accepted paper at the International Conference on Learning
Representations (ICLR) 202
FuSSI-Net: Fusion of Spatio-temporal Skeletons for Intention Prediction Network
Pedestrian intention recognition is very important to develop robust and safe
autonomous driving (AD) and advanced driver assistance systems (ADAS)
functionalities for urban driving. In this work, we develop an end-to-end
pedestrian intention framework that performs well on day- and night- time
scenarios. Our framework relies on objection detection bounding boxes combined
with skeletal features of human pose. We study early, late, and combined (early
and late) fusion mechanisms to exploit the skeletal features and reduce false
positives as well to improve the intention prediction performance. The early
fusion mechanism results in AP of 0.89 and precision/recall of 0.79/0.89 for
pedestrian intention classification. Furthermore, we propose three new metrics
to properly evaluate the pedestrian intention systems. Under these new
evaluation metrics for the intention prediction, the proposed end-to-end
network offers accurate pedestrian intention up to half a second ahead of the
actual risky maneuver.Comment: 5 pages, 6 figures, 5 tables, IEEE Asilomar SS
- …