Search CORE

3 research outputs found

Utterance Augmentation for Speaker Recognition

Author: Chen Zhengying
Chu Andrea
Fang Yeming
Feng Gang
Moreno Mengibar Pedro
Moreno Ignacio Lopez
Pelecanos Jason
Shi Jin
Wang Quan
Publication venue: Technical Disclosure Commons
Publication date: 18/05/2020
Field of study

The speaker recognition problem is to automatically recognize a person from their voice. The training of a speaker recognition model typically requires a very large training corpus, e.g., multiple voice samples from a very large number of individuals. In the diverse domains of application of speaker recognition, it is often impractical to obtain a training corpus of the requisite size. This disclosure describes techniques that augment utterances, e.g., by cutting, splitting, shuffling, etc., such that the need for collections of raw voice samples from individuals is substantially reduced. In effect, the original model works better on the augmented utterances on the target domain

Technical Disclosure Common

Automated Conversion of Impaired Speech in Communication Applications

Author: Biadsy Fadi
Chen Joseph
Feng Gang
Jiang Liyang
Mengibar Pedro Moreno
Rybakov Oleg
Wu Yuexin
Zhang Xia
Publication venue: Technical Disclosure Commons
Publication date: 07/03/2023
Field of study

Voice communication can be difficult for those with impaired or accented speech. When such users communicate with others via applications on their devices, listeners often find it difficult to understand them. This disclosure describes techniques that dynamically process impaired or accented speech and convert it to synthesized canonical speech with permission. Generation of the synthesized speech is performed with low latency as a user speaks, enabling the parties to engage in smooth communication that is unaffected by the speaker’s speech impairment. The listeners receive clear, fluent speech automatically generated by suitably trained models. In addition, users can personalize the operation based on their specific speech impairments. The techniques can be integrated within any messaging, conferencing, or phone calling/ dialer application on any device and can make the applications more accessible to users with impaired speech and enhance the user experience

Technical Disclosure Common

Massive End-to-end Models for Short Search Queries

Author: Cai Xingyu
He Yanzhang
Hwang Dongseong
Li Bo
Li Qiujia
Meng Zhong
Mengibar Pedro Moreno
Prabhavalkar Rohit
Qin James
Sainath Tara
Sim Khe Chai
Stooke Adam
Wang Weiran
Zheng CJ
Publication venue
Publication date: 22/09/2023
Field of study

In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model parameters. The encoders of our models use the neural architecture of Google's universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference. We perform extensive studies on vocabulary size, time reduction strategy, and its generalization performance on long-form test sets. Despite the speculation that, as the model size increases, CTC can be as good as RNN-T which builds label dependency into the prediction, we observe that a 900M RNN-T clearly outperforms a 1.8B CTC and is more tolerant to severe time reduction, although the WER gap can be largely removed by LM shallow fusion

arXiv.org e-Print Archive