255 research outputs found
Attention-Based End-to-End Speech Recognition on Voice Search
Recently, there has been a growing interest in end-to-end speech recognition
that directly transcribes speech to text without any predefined alignments. In
this paper, we explore the use of attention-based encoder-decoder model for
Mandarin speech recognition on a voice search task. Previous attempts have
shown that applying attention-based encoder-decoder to Mandarin speech
recognition was quite difficult due to the logographic orthography of Mandarin,
the large vocabulary and the conditional dependency of the attention model. In
this paper, we use character embedding to deal with the large vocabulary.
Several tricks are used for effective model training, including L2
regularization, Gaussian weight noise and frame skipping. We compare two
attention mechanisms and use attention smoothing to cover long context in the
attention model. Taken together, these tricks allow us to finally achieve a
character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on
the MiTV voice search dataset. While together with a trigram language model,
CER and SER reach 2.81% and 5.77%, respectively
Empirical Evaluation of Speaker Adaptation on DNN based Acoustic Model
Speaker adaptation aims to estimate a speaker specific acoustic model from a
speaker independent one to minimize the mismatch between the training and
testing conditions arisen from speaker variabilities. A variety of neural
network adaptation methods have been proposed since deep learning models have
become the main stream. But there still lacks an experimental comparison
between different methods, especially when DNN-based acoustic models have been
advanced greatly. In this paper, we aim to close this gap by providing an
empirical evaluation of three typical speaker adaptation methods: LIN, LHUC and
KLD. Adaptation experiments, with different size of adaptation data, are
conducted on a strong TDNN-LSTM acoustic model. More challengingly, here, the
source and target we are concerned with are standard Mandarin speaker model and
accented Mandarin speaker model. We compare the performances of different
methods and their combinations. Speaker adaptation performance is also examined
by speaker's accent degree.Comment: Interspeech 201
UKnow: A Unified Knowledge Protocol for Common-Sense Reasoning and Vision-Language Pre-training
This work presents a unified knowledge protocol, called UKnow, which
facilitates knowledge-based studies from the perspective of data. Particularly
focusing on visual and linguistic modalities, we categorize data knowledge into
five unit types, namely, in-image, in-text, cross-image, cross-text, and
image-text. Following this protocol, we collect, from public international
news, a large-scale multimodal knowledge graph dataset that consists of
1,388,568 nodes (with 571,791 vision-related ones) and 3,673,817 triplets. The
dataset is also annotated with rich event tags, including 96 coarse labels and
9,185 fine labels, expanding its potential usage. To further verify that UKnow
can serve as a standard protocol, we set up an efficient pipeline to help
reorganize existing datasets under UKnow format. Finally, we benchmark the
performance of some widely-used baselines on the tasks of common-sense
reasoning and vision-language pre-training. Results on both our new dataset and
the reformatted public datasets demonstrate the effectiveness of UKnow in
knowledge organization and method evaluation. Code, dataset, conversion tool,
and baseline models will be made public
Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition
We investigate the use of generative adversarial networks (GANs) in speech
dereverberation for robust speech recognition. GANs have been recently studied
for speech enhancement to remove additive noises, but there still lacks of a
work to examine their ability in speech dereverberation and the advantages of
using GANs have not been fully established. In this paper, we provide deep
investigations in the use of GAN-based dereverberation front-end in ASR. First,
we study the effectiveness of different dereverberation networks (the generator
in GAN) and find that LSTM leads a significant improvement as compared with
feed-forward DNN and CNN in our dataset. Second, further adding residual
connections in the deep LSTMs can boost the performance as well. Finally, we
find that, for the success of GAN, it is important to update the generator and
the discriminator using the same mini-batch data during training. Moreover,
using reverberant spectrogram as a condition to discriminator, as suggested in
previous studies, may degrade the performance. In summary, our GAN-based
dereverberation front-end achieves 14%-19% relative CER reduction as compared
to the baseline DNN dereverberation network when tested on a strong
multi-condition training acoustic model.Comment: Interspeech 201
Recommended from our members
Highly Efficient Blue-Emitting CsPbBr3 Perovskite Nanocrystals through Neodymium Doping.
Colloidal CsPbX3 (XÂ =Â Br, Cl, and I) perovskite nanocrystals exhibit tunable bandgaps over the entire visible spectrum and high photoluminescence quantum yields in the green and red regions. However, the lack of highly efficient blue-emitting perovskite nanocrystals limits their development for optoelectronic applications. Herein, neodymium (III) (Nd3+) doped CsPbBr3 nanocrystals are prepared through the ligand-assisted reprecipitation method at room temperature with tunable photoemission from green to deep blue. A blue-emitting nanocrystal with a central wavelength at 459Â nm, an exceptionally high photoluminescence quantum yield of 90%, and a spectral width of 19Â nm is achieved. First principles calculations reveal that the increase in photoluminescence quantum yield upon doping is driven by an enhancement of the exciton binding energy due to increased electron and hole effective masses and an increase in oscillator strength due to shortening of the Pb-Br bond. Putting these results together, an all-perovskite white light-emitting diode is successfully fabricated, demonstrating that B-site composition engineering is a reliable strategy to further exploit the perovskite family for wider optoelectronic applications
Strong structural and electronic coupling in metavalent PbS moire superlattices
Moire superlattices are twisted bilayer materials, in which the tunable
interlayer quantum confinement offers access to new physics and novel device
functionalities. Previously, moire superlattices were built exclusively using
materials with weak van der Waals interactions and synthesizing moire
superlattices with strong interlayer chemical bonding was considered to be
impractical. Here using lead sulfide (PbS) as an example, we report a strategy
for synthesizing of moire superlattices coupled by strong chemical bonding. We
use water-soluble ligands as a removable template to obtain free-standing
ultra-thin PbS nanosheets and assemble them into direct-contact bilayers with
various twist angles. Atomic-resolution imaging shows the moire periodic
structural reconstruction at superlattice interface, due to the strong
metavalent coupling. Electron energy loss spectroscopy and theoretical
calculations collectively reveal the twist angle26 dependent electronic
structure, especially the emergent separation of flat bands at small twist
angles. The localized states of flat bands are similar to well-arranged quantum
dots, promising an application in devices. This study opens a new door to the
exploration of deep energy modulations within moire superlattices alternative
to van der Waals twistronics
- …