Search CORE

255 research outputs found

Attention-Based End-to-End Speech Recognition on Voice Search

Author: Shan Changhao
Wang Yujun
Xie Lei
Zhang Junbo
Publication venue
Publication date: 13/02/2018
Field of study

Recently, there has been a growing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition on a voice search task. Previous attempts have shown that applying attention-based encoder-decoder to Mandarin speech recognition was quite difficult due to the logographic orthography of Mandarin, the large vocabulary and the conditional dependency of the attention model. In this paper, we use character embedding to deal with the large vocabulary. Several tricks are used for effective model training, including L2 regularization, Gaussian weight noise and frame skipping. We compare two attention mechanisms and use attention smoothing to cover long context in the attention model. Taken together, these tricks allow us to finally achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on the MiTV voice search dataset. While together with a trigram language model, CER and SER reach 2.81% and 5.77%, respectively

arXiv.org e-Print Archive

Crossref

Empirical Evaluation of Speaker Adaptation on DNN based Acoustic Model

Author: Wang Ke
Wang Yujun
Xie Lei
Zhang Junbo
Publication venue: 'International Speech Communication Association'
Publication date: 25/10/2018
Field of study

Speaker adaptation aims to estimate a speaker specific acoustic model from a speaker independent one to minimize the mismatch between the training and testing conditions arisen from speaker variabilities. A variety of neural network adaptation methods have been proposed since deep learning models have become the main stream. But there still lacks an experimental comparison between different methods, especially when DNN-based acoustic models have been advanced greatly. In this paper, we aim to close this gap by providing an empirical evaluation of three typical speaker adaptation methods: LIN, LHUC and KLD. Adaptation experiments, with different size of adaptation data, are conducted on a strong TDNN-LSTM acoustic model. More challengingly, here, the source and target we are concerned with are standard Mandarin speaker model and accented Mandarin speaker model. We compare the performances of different methods and their combinations. Speaker adaptation performance is also examined by speaker's accent degree.Comment: Interspeech 201

arXiv.org e-Print Archive

Crossref

UKnow: A Unified Knowledge Protocol for Common-Sense Reasoning and Vision-Language Pre-training

Author: Feng Yutong
Gong Biao
Lv Yiliang
Shen Yujun
Xie Xiaoying
Zhao Deli
Publication venue
Publication date: 14/02/2023
Field of study

This work presents a unified knowledge protocol, called UKnow, which facilitates knowledge-based studies from the perspective of data. Particularly focusing on visual and linguistic modalities, we categorize data knowledge into five unit types, namely, in-image, in-text, cross-image, cross-text, and image-text. Following this protocol, we collect, from public international news, a large-scale multimodal knowledge graph dataset that consists of 1,388,568 nodes (with 571,791 vision-related ones) and 3,673,817 triplets. The dataset is also annotated with rich event tags, including 96 coarse labels and 9,185 fine labels, expanding its potential usage. To further verify that UKnow can serve as a standard protocol, we set up an efficient pipeline to help reorganize existing datasets under UKnow format. Finally, we benchmark the performance of some widely-used baselines on the tasks of common-sense reasoning and vision-language pre-training. Results on both our new dataset and the reformatted public datasets demonstrate the effectiveness of UKnow in knowledge organization and method evaluation. Code, dataset, conversion tool, and baseline models will be made public

arXiv.org e-Print Archive

Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition

Author: Sun Sining
Wang Ke
Wang Yujun
Xiang Fei
Xie Lei
Zhang Junbo
Publication venue: 'International Speech Communication Association'
Publication date: 25/10/2018
Field of study

We investigate the use of generative adversarial networks (GANs) in speech dereverberation for robust speech recognition. GANs have been recently studied for speech enhancement to remove additive noises, but there still lacks of a work to examine their ability in speech dereverberation and the advantages of using GANs have not been fully established. In this paper, we provide deep investigations in the use of GAN-based dereverberation front-end in ASR. First, we study the effectiveness of different dereverberation networks (the generator in GAN) and find that LSTM leads a significant improvement as compared with feed-forward DNN and CNN in our dataset. Second, further adding residual connections in the deep LSTMs can boost the performance as well. Finally, we find that, for the success of GAN, it is important to update the generator and the discriminator using the same mini-batch data during training. Moreover, using reverberant spectrogram as a condition to discriminator, as suggested in previous studies, may degrade the performance. In summary, our GAN-based dereverberation front-end achieves 14%-19% relative CER reduction as compared to the baseline DNN dereverberation network when tested on a strong multi-condition training acoustic model.Comment: Interspeech 201

arXiv.org e-Print Archive

Crossref

Recommended from our members

Highly Efficient Blue-Emitting CsPbBr3 Perovskite Nanocrystals through Neodymium Doping.

Author: Bravić Ivona
Dong Yurong
Liang Rongqing
Monserrat Bartomeu
Ou Qiongrong
Peng Bo
Xie Yujun
Yu Yan
Zhang Shuyu
Publication venue: Adv Sci (Weinh)
Publication date: 01/10/2020
Field of study

Colloidal CsPbX3 (X = Br, Cl, and I) perovskite nanocrystals exhibit tunable bandgaps over the entire visible spectrum and high photoluminescence quantum yields in the green and red regions. However, the lack of highly efficient blue-emitting perovskite nanocrystals limits their development for optoelectronic applications. Herein, neodymium (III) (Nd3+) doped CsPbBr3 nanocrystals are prepared through the ligand-assisted reprecipitation method at room temperature with tunable photoemission from green to deep blue. A blue-emitting nanocrystal with a central wavelength at 459 nm, an exceptionally high photoluminescence quantum yield of 90%, and a spectral width of 19 nm is achieved. First principles calculations reveal that the increase in photoluminescence quantum yield upon doping is driven by an enhancement of the exciton binding energy due to increased electron and hole effective masses and an increase in oscillator strength due to shortening of the Pb-Br bond. Putting these results together, an all-perovskite white light-emitting diode is successfully fabricated, demonstrating that B-site composition engineering is a reliable strategy to further exploit the perovskite family for wider optoelectronic applications

Apollo (Cambridge)

Strong structural and electronic coupling in metavalent PbS moire superlattices

Author: Betzler Sophia
Bustillo Karen C.
Ercius Peter
Ophus Colin
Song Zhigang
Wan Jiawei
Wang Lin-Wang
Wang Yu
Xie Yujun
Zheng Haimei
Publication venue
Publication date: 22/07/2022
Field of study

Moire superlattices are twisted bilayer materials, in which the tunable interlayer quantum confinement offers access to new physics and novel device functionalities. Previously, moire superlattices were built exclusively using materials with weak van der Waals interactions and synthesizing moire superlattices with strong interlayer chemical bonding was considered to be impractical. Here using lead sulfide (PbS) as an example, we report a strategy for synthesizing of moire superlattices coupled by strong chemical bonding. We use water-soluble ligands as a removable template to obtain free-standing ultra-thin PbS nanosheets and assemble them into direct-contact bilayers with various twist angles. Atomic-resolution imaging shows the moire periodic structural reconstruction at superlattice interface, due to the strong metavalent coupling. Electron energy loss spectroscopy and theoretical calculations collectively reveal the twist angle26 dependent electronic structure, especially the emergent separation of flat bands at small twist angles. The localized states of flat bands are similar to well-arranged quantum dots, promising an application in devices. This study opens a new door to the exploration of deep energy modulations within moire superlattices alternative to van der Waals twistronics

arXiv.org e-Print Archive