Search CORE

126 research outputs found

Improving the Computational Efficiency of Training and Application of Neural Language Models for Automatic Speech Recognition

Author: Xu Hainan
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 16/02/2021
Field of study

A language model is a vital component of automatic speech recognition systems. In recent years, advancements in neural network technologies have brought vast improvement in various machine learning tasks, including language modeling. However, compared to the conventional backoff n-gram models, neural networks require much greater computation power and cannot completely replace the conventional methods. In this work, we examine the pipeline of a typical hybrid speech recognition system. In a hybrid speech recognition system, the acoustic and language models are trained separately and used in conjunction. We propose ways to speed up the computation induced by the language model in various components. In the context of neural-network language modeling, we propose a new loss function that modifies the standard cross-entropy loss that trains the neural network to self-normalize, which we call a linear loss. The linear loss significantly reduces inference-time computation and allows us to use an importance-sampling based method in computing an unbiased estimator of the loss function during neural network training. We conduct extensive experiments comparing the linear loss and several commonly used self-normalizing loss functions and show linear loss's superiority. We also show that we can initialize with a well-trained language model trained with the cross-entropy loss and convert it into a self-normalizing linear loss system with minimal training. The trained system preserves the performance and also achieves the self-normalizing capability. We refine the sampling procedure for commonly used sampling-based approaches. We propose using a sampling-without-replacement scheme, which improves the model performance and allows a more efficient algorithm to be used to minimize the sampling overhead. We propose a speed-up of the algorithm that significantly reduces the sampling run-time while not affecting performance. We demonstrate that using the sampling-without-replacement scheme consistently outperforms traditional sampling-with-replacement methods across multiple training loss functions for language models. We also experiment with changing the sampling distribution for importance-sampling by utilizing longer histories. For batched training, we propose a method to generate the sampling distribution by averaging the n-gram distributions of the whole batch. Experiments show that including longer histories for sampling can help improve the rate of convergence and enhance the trained model's performance. To reduce the computational overhead with sampling from higher-order n-grams, we propose a 2-stage sampling algorithm that only adds a small overhead compared to the commonly used unigram-based sampling schemes. When applying a trained neural-network for lattice-rescoring for ASR, we propose a pruning algorithm that runs much faster than the standard algorithm and improves ASR performances. The methods proposed in this dissertation will make the application of neural language models in speech recognition significantly more computationally efficient. This allows researchers to apply larger and more sophisticated networks in their research and enable companies to provide better speech-based service to customers. Some of the methods proposed in this dissertation are not limited to neural language modeling and may facilitate neural network research in other fields

Johns Hopkins University

JScholarship

Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora

Author: Koehn Philipp
Xu Hainan
Publication venue
Publication date: 11/09/2017
Field of study

Edinburgh Research Explorer

Multi-blank Transducers for Speech Recognition

Author: Ginsburg Boris
Jia Fei
Majumdar Somshubra
Watanabe Shinji
Xu Hainan
Publication venue
Publication date: 04/11/2022
Field of study

This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo (\url{https://github.com/NVIDIA/NeMo}) toolkit.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

Microvessel density and heparanase over-expression in clear cell renal cell cancer: correlations and prognostic significances

Author: Li Dawei
Liu Hainan
Ren Juchao
Tian Sujian
Xu Zhonghua
Yan Lei
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Tumor angiogenesis is important in the progression of malignancies, and heparanase plays an important role in sustaining the pathology of clear cell renal cell cancer (ccRCC). The study was carried out to investigate the correlations between microvessel density (MVD) and heparanase expression containing prognostic significances in the patients with ccRCC. Methods Specimens from 128 patients with ccRCC were investigated by immunohistochemistry for MVD. RT-PCR and immunohistochemistry were used to detect heparanase expression. Correlations between MVD, heparanase expression, and various clinico-pathological factors were studied. The prognostic significances of MVD and heparanase expression were also analysed. Results We discovered a statistically significant prevalence of higher MVD in ccRCC compared with adjacent normal renal tissues. MVD was positively correlated with TNM stage and distant metastasis in ccRCC patients, and was also correlated with the expression level of heparanase. Heparanase is over-expressed and correlated with TNM stage, histologic grade, distant metastasis and lymphatic metastasis in ccRCC. High MVD and heparanase over-expression inversely correlate with the survival of ccRCC patients. Conclusions Heparanase contributes to angiogenesis of ccRCC and over-expression of heparanase is an independent predictors of prognosis for ccRCC. MVD is correlated with tumor development and metastasis in ccRCC.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

Author: Ginsburg Boris
Huang He
Jia Fei
Majumdar Somshubra
Watanabe Shinji
Xu Hainan
Publication venue
Publication date: 13/04/2023
Field of study

This paper introduces a novel Token-and-Duration Transducer (TDT) architecture for sequence-to-sequence tasks. TDT extends conventional RNN-Transducer architectures by jointly predicting both a token and its duration, i.e. the number of input frames covered by the emitted token. This is achieved by using a joint network with two outputs which are independently normalized to generate distributions over tokens and durations. During inference, TDT models can skip input frames guided by the predicted duration output, which makes them significantly faster than conventional Transducers which process the encoder output frame by frame. TDT models achieve both better accuracy and significantly faster inference than conventional Transducers on different sequence transduction tasks. TDT models for Speech Recognition achieve better accuracy and up to 2.82X faster inference than RNN-Transducers. TDT models for Speech Translation achieve an absolute gain of over 1 BLEU on the MUST-C test compared with conventional Transducers, and its inference is 2.27X faster. In Speech Intent Classification and Slot Filling tasks, TDT models improve the intent accuracy up to over 1% (absolute) over conventional Transducers, while running up to 1.28X faster

arXiv.org e-Print Archive

The JHU Parallel Corpus Filtering Systems for WMT 2018

Author: Khayrallah Huda
Koehn Philipp
Xu Hainan
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

Crossref

Edinburgh Research Explorer