255 research outputs found

    Attention-Based End-to-End Speech Recognition on Voice Search

    Full text link
    Recently, there has been a growing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. In this paper, we explore the use of attention-based encoder-decoder model for Mandarin speech recognition on a voice search task. Previous attempts have shown that applying attention-based encoder-decoder to Mandarin speech recognition was quite difficult due to the logographic orthography of Mandarin, the large vocabulary and the conditional dependency of the attention model. In this paper, we use character embedding to deal with the large vocabulary. Several tricks are used for effective model training, including L2 regularization, Gaussian weight noise and frame skipping. We compare two attention mechanisms and use attention smoothing to cover long context in the attention model. Taken together, these tricks allow us to finally achieve a character error rate (CER) of 3.58% and a sentence error rate (SER) of 7.43% on the MiTV voice search dataset. While together with a trigram language model, CER and SER reach 2.81% and 5.77%, respectively

    Empirical Evaluation of Speaker Adaptation on DNN based Acoustic Model

    Full text link
    Speaker adaptation aims to estimate a speaker specific acoustic model from a speaker independent one to minimize the mismatch between the training and testing conditions arisen from speaker variabilities. A variety of neural network adaptation methods have been proposed since deep learning models have become the main stream. But there still lacks an experimental comparison between different methods, especially when DNN-based acoustic models have been advanced greatly. In this paper, we aim to close this gap by providing an empirical evaluation of three typical speaker adaptation methods: LIN, LHUC and KLD. Adaptation experiments, with different size of adaptation data, are conducted on a strong TDNN-LSTM acoustic model. More challengingly, here, the source and target we are concerned with are standard Mandarin speaker model and accented Mandarin speaker model. We compare the performances of different methods and their combinations. Speaker adaptation performance is also examined by speaker's accent degree.Comment: Interspeech 201

    UKnow: A Unified Knowledge Protocol for Common-Sense Reasoning and Vision-Language Pre-training

    Full text link
    This work presents a unified knowledge protocol, called UKnow, which facilitates knowledge-based studies from the perspective of data. Particularly focusing on visual and linguistic modalities, we categorize data knowledge into five unit types, namely, in-image, in-text, cross-image, cross-text, and image-text. Following this protocol, we collect, from public international news, a large-scale multimodal knowledge graph dataset that consists of 1,388,568 nodes (with 571,791 vision-related ones) and 3,673,817 triplets. The dataset is also annotated with rich event tags, including 96 coarse labels and 9,185 fine labels, expanding its potential usage. To further verify that UKnow can serve as a standard protocol, we set up an efficient pipeline to help reorganize existing datasets under UKnow format. Finally, we benchmark the performance of some widely-used baselines on the tasks of common-sense reasoning and vision-language pre-training. Results on both our new dataset and the reformatted public datasets demonstrate the effectiveness of UKnow in knowledge organization and method evaluation. Code, dataset, conversion tool, and baseline models will be made public

    Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition

    Full text link
    We investigate the use of generative adversarial networks (GANs) in speech dereverberation for robust speech recognition. GANs have been recently studied for speech enhancement to remove additive noises, but there still lacks of a work to examine their ability in speech dereverberation and the advantages of using GANs have not been fully established. In this paper, we provide deep investigations in the use of GAN-based dereverberation front-end in ASR. First, we study the effectiveness of different dereverberation networks (the generator in GAN) and find that LSTM leads a significant improvement as compared with feed-forward DNN and CNN in our dataset. Second, further adding residual connections in the deep LSTMs can boost the performance as well. Finally, we find that, for the success of GAN, it is important to update the generator and the discriminator using the same mini-batch data during training. Moreover, using reverberant spectrogram as a condition to discriminator, as suggested in previous studies, may degrade the performance. In summary, our GAN-based dereverberation front-end achieves 14%-19% relative CER reduction as compared to the baseline DNN dereverberation network when tested on a strong multi-condition training acoustic model.Comment: Interspeech 201

    Strong structural and electronic coupling in metavalent PbS moire superlattices

    Full text link
    Moire superlattices are twisted bilayer materials, in which the tunable interlayer quantum confinement offers access to new physics and novel device functionalities. Previously, moire superlattices were built exclusively using materials with weak van der Waals interactions and synthesizing moire superlattices with strong interlayer chemical bonding was considered to be impractical. Here using lead sulfide (PbS) as an example, we report a strategy for synthesizing of moire superlattices coupled by strong chemical bonding. We use water-soluble ligands as a removable template to obtain free-standing ultra-thin PbS nanosheets and assemble them into direct-contact bilayers with various twist angles. Atomic-resolution imaging shows the moire periodic structural reconstruction at superlattice interface, due to the strong metavalent coupling. Electron energy loss spectroscopy and theoretical calculations collectively reveal the twist angle26 dependent electronic structure, especially the emergent separation of flat bands at small twist angles. The localized states of flat bands are similar to well-arranged quantum dots, promising an application in devices. This study opens a new door to the exploration of deep energy modulations within moire superlattices alternative to van der Waals twistronics
    • …