Search CORE

598 research outputs found

LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech

Author: Chen Jie
Pan Fuping
Peng Zhendong
Song Xingchen
Wu Zhiyong
Zhang Binbin
Publication venue
Publication date: 31/08/2023
Field of study

Recent advances in neural text-to-speech (TTS) models bring thousands of TTS applications into daily life, where models are deployed in cloud to provide services for customs. Among these models are diffusion probabilistic models (DPMs), which can be stably trained and are more parameter-efficient compared with other generative models. As transmitting data between customs and the cloud introduces high latency and the risk of exposing private data, deploying TTS models on edge devices is preferred. When implementing DPMs onto edge devices, there are two practical problems. First, current DPMs are not lightweight enough for resource-constrained devices. Second, DPMs require many denoising steps in inference, which increases latency. In this work, we present LightGrad, a lightweight DPM for TTS. LightGrad is equipped with a lightweight U-Net diffusion decoder and a training-free fast sampling technique, reducing both model parameters and inference latency. Streaming inference is also implemented in LightGrad to reduce latency further. Compared with Grad-TTS, LightGrad achieves 62.2% reduction in paramters, 65.7% reduction in latency, while preserving comparable speech quality on both Chinese Mandarin and English in 4 denoising steps.Comment: Accepted by ICASSP 202

arXiv.org e-Print Archive

ZeroPrompt: Streaming Acoustic Encoders are Zero-Shot Masked LMs

Author: Dang Bo
Pan Fuping
Peng Zhendong
Song Xingchen
Wu Di
Wu Zhiyong
Zhang Binbin
Publication venue
Publication date: 17/05/2023
Field of study

In this paper, we present ZeroPrompt (Figure 1-(a)) and the corresponding Prompt-and-Refine strategy (Figure 3), two simple but effective \textbf{training-free} methods to decrease the Token Display Time (TDT) of streaming ASR models \textbf{without any accuracy loss}. The core idea of ZeroPrompt is to append zeroed content to each chunk during inference, which acts like a prompt to encourage the model to predict future tokens even before they were spoken. We argue that streaming acoustic encoders naturally have the modeling ability of Masked Language Models and our experiments demonstrate that ZeroPrompt is engineering cheap and can be applied to streaming acoustic encoders on any dataset without any accuracy loss. Specifically, compared with our baseline models, we achieve 350

\sim

700ms reduction on First Token Display Time (TDT-F) and 100

\sim

400ms reduction on Last Token Display Time (TDT-L), with theoretically and experimentally equal WER on both Aishell-1 and Librispeech datasets.Comment: accepted by interspeech 202

arXiv.org e-Print Archive

CB-Conformer: Contextual biasing Conformer for biased word recognition

Author: and Qiaochu Huang
Kang Shiyin
Liu Baiji
Meng Helen
Song Xingchen
Wu Zhiyong
Xu Yaoxun
Publication venue
Publication date: 19/04/2023
Field of study

Due to the mismatch between the source and target domains, how to better utilize the biased word information to improve the performance of the automatic speech recognition model in the target domain becomes a hot research topic. Previous approaches either decode with a fixed external language model or introduce a sizeable biasing module, which leads to poor adaptability and slow inference. In this work, we propose CB-Conformer to improve biased word recognition by introducing the Contextual Biasing Module and the Self-Adaptive Language Model to vanilla Conformer. The Contextual Biasing Module combines audio fragments and contextual information, with only 0.2% model parameters of the original Conformer. The Self-Adaptive Language Model modifies the internal weights of biased words based on their recall and precision, resulting in a greater focus on biased words and more successful integration with the automatic speech recognition model than the standard fixed language model. In addition, we construct and release an open-source Mandarin biased-word dataset based on WenetSpeech. Experiments indicate that our proposed method brings a 15.34% character error rate reduction, a 14.13% biased word recall increase, and a 6.80% biased word F1-score increase compared with the base Conformer

arXiv.org e-Print Archive

Fast-U2++: Fast and Accurate End-to-End Speech Recognition in Joint CTC/Attention Frames

Author: Li Shengqiang
Liang Chengdong
Pan Fuping
Peng Zhendong
Song Xingchen
Wu Di
Zhang BinBin
Zhang Xiao-Lei
Publication venue
Publication date: 02/11/2022
Field of study

Recently, the unified streaming and non-streaming two-pass (U2/U2++) end-to-end model for speech recognition has shown great performance in terms of streaming capability, accuracy and latency. In this paper, we present fast-U2++, an enhanced version of U2++ to further reduce partial latency. The core idea of fast-U2++ is to output partial results of the bottom layers in its encoder with a small chunk, while using a large chunk in the top layers of its encoder to compensate the performance degradation caused by the small chunk. Moreover, we use knowledge distillation method to reduce the token emission latency. We present extensive experiments on Aishell-1 dataset. Experiments and ablation studies show that compared to U2++, fast-U2++ reduces model latency from 320ms to 80ms, and achieves a character error rate (CER) of 5.06% with a streaming setup.Comment: 5 pages, 3 figure

arXiv.org e-Print Archive

TrimTail: Low-Latency Streaming ASR with Simple but Effective Spectrogram-Level Length Penalty

Author: Li Wenpeng
Pan Fuping
Peng Zhendong
Song Xingchen
Wu Di
Wu Zhiyong
Zhang Binbin
Zhang Yuekai
Zhu Changbao
Publication venue
Publication date: 01/11/2022
Field of study

In this paper, we present TrimTail, a simple but effective emission regularization method to improve the latency of streaming ASR models. The core idea of TrimTail is to apply length penalty (i.e., by trimming trailing frames, see Fig. 1-(b)) directly on the spectrogram of input utterances, which does not require any alignment. We demonstrate that TrimTail is computationally cheap and can be applied online and optimized with any training loss or any model architecture on any dataset without any extra effort by applying it on various end-to-end streaming ASR networks either trained with CTC loss [1] or Transducer loss [2]. We achieve 100

\sim

200ms latency reduction with equal or even better accuracy on both Aishell-1 and Librispeech. Moreover, by using TrimTail, we can achieve a 400ms algorithmic improvement of User Sensitive Delay (USD) with an accuracy loss of less than 0.2.Comment: submitted to ICASSP 202

arXiv.org e-Print Archive

Taxonomic Distribution of FosB in Human-Microbiota and Activity Comparison of Fosfomycin Resistance

Author: Jing Li
Jing Li
Lianwen Qi
Lianwen Qi
Owais Ahmad
Owais Ahmad
Owais Ahmad
Ping Li
Ping Li
Su Jiang
Xingchen Zhou
Xingchen Zhou
Xue Wang
Xue Wang
Yuanyuan Li
Ziwei Song
Ziwei Song
Ziwei Song
Publication venue: 'Frontiers Media SA'
Publication date: 01/02/2019
Field of study

FosB, a Mg2+ dependent thioltransferase, confers antibiotic resistance to fosfomycin through enzymatic drug inactivation. Among all antibiotic resistant proteins in the Antibiotic Resistance Genes Database and the Comprehensive Antibiotic Resistance Database, FosB is within 5% of the most number of ARPs identified in Human Microbiome Project reference database but mainly distributed in limited genera, i.e., 122 of total 133 FosB homologues are found from Bacillus and Staphylococcus. Furthermore, these FosB sequences could be divided into three clusters based on their phylogenetic relationship, i.e., two groups of FosB were mainly from Bacillus, and another was mainly from Staphylococcus. Finally, we confirmed that FosB from the group of Staphylococcus presented the highest resistance ability to fosfomycin by in silico and in vitro comparisons. In summary, this study elaborates the specific taxonomic characteristics and resistant abilities of FosB in human microbiota, which might help in developing more promising fosfomycin-like antibiotics

Directory of Open Access Journals

Bond-Selective Intensity Diffraction Tomography

Author: Chen Fukai
Chen Zhicong
Cheng Ji-Xin
Lin Xingchen
Matlock Alex
Song Ziqi
Tian Lei
Wang Biao
Xu Yihong
Zhan Yuewei
Zhao Jian
Zhu Hongbo
Zhu Jiabei
Publication venue
Publication date: 01/09/2022
Field of study

Recovering molecular information remains a grand challenge in the widely used holographic and computational imaging technologies. To address this challenge, we developed a computational mid-infrared photothermal microscope, termed Bond-selective Intensity Diffraction Tomography (BS-IDT). Based on a low-cost brightfield microscope with an add-on pulsed light source, BS-IDT recovers both infrared spectra and bond-selective 3D refractive index maps from intensity-only measurements. High-fidelity infrared fingerprint spectra extraction is validated. Volumetric chemical imaging of biological cells is demonstrated at a speed of ~20 seconds per volume, with a lateral and axial resolution of ~350 nm and ~1.1 micron, respectively. BS-IDT's application potential is investigated by chemically quantifying lipids stored in cancer cells and volumetric chemical imaging on Caenorhabditis elegans with a large field of view (~100 micron X 100 micron)

arXiv.org e-Print Archive