Search CORE

674 research outputs found

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

Author: Haws David
Kingsbury Brian
Saon George
Shi Jiatong
Watanabe Shinji
Publication venue
Publication date: 02/08/2022
Field of study

Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses. However, recent studies have shown that decoding with hypothesis merging can achieve a more efficient search with comparable or better performance. But, the full context in recurrent networks is not compatible with hypothesis merging. We propose to use vector-quantized long short-term memory units (VQ-LSTM) in the prediction network of RNN transducers. By training the discrete representation jointly with the ASR network, hypotheses can be actively merged for lattice generation. Our experiments on the Switchboard corpus show that the proposed VQ RNN transducers improve ASR performance over transducers with regular prediction networks while also producing denser lattices with a very low oracle word error rate (WER) for the same beam size. Additional language model rescoring experiments also demonstrate the effectiveness of the proposed lattice generation scheme.Comment: Interspeech 2022 accepted pape

arXiv.org e-Print Archive

Cross-Lingual Voice Conversion with Non-Parallel Data

Author: Alonso-Jiménez Pablo
Publication venue
Publication date
Field of study

In this project a Phonetic Posteriorgram (PPG) based Voice Conversion system is implemented. The main goal is to perform and evaluate conversions of singing voice. The cross-gender and cross-lingual scenarios are considered. Additionally, the use of spectral envelope based MFCC and pseudo-singing dataset for ASR training are proposed in order to improve the performance of the system in the singing context

ZENODO