3 research outputs found
Utterance Augmentation for Speaker Recognition
The speaker recognition problem is to automatically recognize a person from their voice. The training of a speaker recognition model typically requires a very large training corpus, e.g., multiple voice samples from a very large number of individuals. In the diverse domains of application of speaker recognition, it is often impractical to obtain a training corpus of the requisite size. This disclosure describes techniques that augment utterances, e.g., by cutting, splitting, shuffling, etc., such that the need for collections of raw voice samples from individuals is substantially reduced. In effect, the original model works better on the augmented utterances on the target domain
Automated Conversion of Impaired Speech in Communication Applications
Voice communication can be difficult for those with impaired or accented speech. When such users communicate with others via applications on their devices, listeners often find it difficult to understand them. This disclosure describes techniques that dynamically process impaired or accented speech and convert it to synthesized canonical speech with permission. Generation of the synthesized speech is performed with low latency as a user speaks, enabling the parties to engage in smooth communication that is unaffected by the speaker’s speech impairment. The listeners receive clear, fluent speech automatically generated by suitably trained models. In addition, users can personalize the operation based on their specific speech impairments. The techniques can be integrated within any messaging, conferencing, or phone calling/ dialer application on any device and can make the applications more accessible to users with impaired speech and enhance the user experience
Massive End-to-end Models for Short Search Queries
In this work, we investigate two popular end-to-end automatic speech
recognition (ASR) models, namely Connectionist Temporal Classification (CTC)
and RNN-Transducer (RNN-T), for offline recognition of voice search queries,
with up to 2B model parameters. The encoders of our models use the neural
architecture of Google's universal speech model (USM), with additional funnel
pooling layers to significantly reduce the frame rate and speed up training and
inference. We perform extensive studies on vocabulary size, time reduction
strategy, and its generalization performance on long-form test sets. Despite
the speculation that, as the model size increases, CTC can be as good as RNN-T
which builds label dependency into the prediction, we observe that a 900M RNN-T
clearly outperforms a 1.8B CTC and is more tolerant to severe time reduction,
although the WER gap can be largely removed by LM shallow fusion