Search CORE

10 research outputs found

Non-native children speech recognition through transfer learning

Author: Falavigna Daniele
Giuliani Diego
Gretter Roberto
Matassoni Marco
Publication venue
Publication date: 01/01/2018
Field of study

This work deals with non-native children's speech and investigates both multi-task and transfer learning approaches to adapt a multi-language Deep Neural Network (DNN) to speakers, specifically children, learning a foreign language. The application scenario is characterized by young students learning English and German and reading sentences in these second-languages, as well as in their mother language. The paper analyzes and discusses techniques for training effective DNN-based acoustic models starting from children native speech and performing adaptation with limited non-native audio material. A multi-lingual model is adopted as baseline, where a common phonetic lexicon, defined in terms of the units of the International Phonetic Alphabet (IPA), is shared across the three languages at hand (Italian, German and English); DNN adaptation methods based on transfer learning are evaluated on significant non-native evaluation sets. Results show that the resulting non-native models allow a significant improvement with respect to a mono-lingual system adapted to speakers of the target language

arXiv.org e-Print Archive

Crossref

Archivio della ricerca - Fondazione Bruno Kessler

Multi-task Learning for Speaker Verification and Voice Trigger Detection

Author: Bridle John
Kajarekar Sachin
Marchi Erik
Naik Devang
Sigtia Siddharth
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/01/2020
Field of study

Automatic speech transcription and speaker recognition are usually treated as separate tasks even though they are interdependent. In this study, we investigate training a single network to perform both tasks jointly. We train the network in a supervised multi-task learning setup, where the speech transcription branch of the network is trained to minimise a phonetic connectionist temporal classification (CTC) loss while the speaker recognition branch of the network is trained to label the input sequence with the correct label for the speaker. We present a large-scale empirical study where the model is trained using several thousand hours of labelled training data for each task. We evaluate the speech transcription branch of the network on a voice trigger detection task while the speaker recognition branch is evaluated on a speaker verification task. Results demonstrate that the network is able to encode both phonetic \emph{and} speaker information in its learnt representations while yielding accuracies at least as good as the baseline models for each task, with the same number of parameters as the independent models

arXiv.org e-Print Archive

Crossref

Deep learning methods in speaker recognition: a review

Author: Beke András
Szaszák György
Sztahó Dávid
Publication venue
Publication date: 14/11/2019
Field of study

This paper summarizes the applied deep learning practices in the field of speaker recognition, both verification and identification. Speaker recognition has been a widely used field topic of speech technology. Many research works have been carried out and little progress has been achieved in the past 5-6 years. However, as deep learning techniques do advance in most machine learning fields, the former state-of-the-art methods are getting replaced by them in speaker recognition too. It seems that DL becomes the now state-of-the-art solution for both speaker verification and identification. The standard x-vectors, additional to i-vectors, are used as baseline in most of the novel works. The increasing amount of gathered data opens up the territory to DL, where they are the most effective

arXiv.org e-Print Archive