2,861 research outputs found
Adversarial Speaker Adaptation
We propose a novel adversarial speaker adaptation (ASA) scheme, in which
adversarial learning is applied to regularize the distribution of deep hidden
features in a speaker-dependent (SD) deep neural network (DNN) acoustic model
to be close to that of a fixed speaker-independent (SI) DNN acoustic model
during adaptation. An additional discriminator network is introduced to
distinguish the deep features generated by the SD model from those produced by
the SI model. In ASA, with a fixed SI model as the reference, an SD model is
jointly optimized with the discriminator network to minimize the senone
classification loss, and simultaneously to mini-maximize the SI/SD
discrimination loss on the adaptation data. With ASA, a senone-discriminative
deep feature is learned in the SD model with a similar distribution to that of
the SI model. With such a regularized and adapted deep feature, the SD model
can perform improved automatic speech recognition on the target speaker's
speech. Evaluated on the Microsoft short message dictation dataset, ASA
achieves 14.4% and 7.9% relative word error rate improvements for supervised
and unsupervised adaptation, respectively, over an SI model trained from 2600
hours data, with 200 adaptation utterances per speaker.Comment: 5 pages, 2 figures, ICASSP 201
Generative Model with Coordinate Metric Learning for Object Recognition Based on 3D Models
Given large amount of real photos for training, Convolutional neural network
shows excellent performance on object recognition tasks. However, the process
of collecting data is so tedious and the background are also limited which
makes it hard to establish a perfect database. In this paper, our generative
model trained with synthetic images rendered from 3D models reduces the
workload of data collection and limitation of conditions. Our structure is
composed of two sub-networks: semantic foreground object reconstruction network
based on Bayesian inference and classification network based on multi-triplet
cost function for avoiding over-fitting problem on monotone surface and fully
utilizing pose information by establishing sphere-like distribution of
descriptors in each category which is helpful for recognition on regular photos
according to poses, lighting condition, background and category information of
rendered images. Firstly, our conjugate structure called generative model with
metric learning utilizing additional foreground object channels generated from
Bayesian rendering as the joint of two sub-networks. Multi-triplet cost
function based on poses for object recognition are used for metric learning
which makes it possible training a category classifier purely based on
synthetic data. Secondly, we design a coordinate training strategy with the
help of adaptive noises acting as corruption on input images to help both
sub-networks benefit from each other and avoid inharmonious parameter tuning
due to different convergence speed of two sub-networks. Our structure achieves
the state of the art accuracy of over 50\% on ShapeNet database with data
migration obstacle from synthetic images to real photos. This pipeline makes it
applicable to do recognition on real images only based on 3D models.Comment: 14 page
- …