5,759 research outputs found
Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop
We summarize the accomplishments of a multi-disciplinary workshop exploring
the computational and scientific issues surrounding the discovery of linguistic
units (subwords and words) in a language without orthography. We study the
replacement of orthographic transcriptions by images and/or translated text in
a well-resourced language to help unsupervised discovery from raw speech.Comment: Accepted to ICASSP 201
Recommended from our members
Unsupervised intralingual and cross-lingual speaker adaptation for HMM-based speech synthesis using two-pass decision tree construction
Hidden Markov model (HMM)-based speech synthesis systems possess several advantages over concatenative synthesis systems. One such advantage is the relative ease with which HMM-based systems are adapted to speakers not present in the training dataset. Speaker adaptation methods used in the field of HMM-based automatic speech recognition (ASR) are adopted for this task. In the case of unsupervised speaker adaptation, previous work has used a supplementary set of acoustic models to estimate the transcription of the adaptation data. This paper firstly presents an approach to the unsupervised speaker adaptation task for HMM-based speech synthesis models which avoids the need for such supplementary acoustic models. This is achieved by defining a mapping between HMM-based synthesis models and ASR-style models, via a two-pass decision tree construction process. Secondly, it is shown that this mapping also enables unsupervised adaptation of HMM-based speech synthesis models without the need to perform linguistic analysis of the estimated transcription of the adaptation data. Thirdly, this paper demonstrates how this technique lends itself to the task of unsupervised cross-lingual adaptation of HMM-based speech synthesis models, and explains the advantages of such an approach. Finally, listener evaluations reveal that the proposed unsupervised adaptation methods deliver performance approaching that of supervised adaptation
Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator
We propose an end-to-end Automatic Speech Recognition (ASR) system that can
be trained on transcribed speech data, text-only data, or a mixture of both.
The proposed model uses an integrated auxiliary block for text-based training.
This block combines a non-autoregressive multi-speaker text-to-mel-spectrogram
generator with a GAN-based enhancer to improve the spectrogram quality. The
proposed system can generate a mel-spectrogram dynamically during training. It
can be used to adapt the ASR model to a new domain by using text-only data from
this domain. We demonstrate that the proposed training method significantly
improves ASR accuracy compared to the system trained on transcribed speech
only. It also surpasses cascade TTS systems with the vocoder in the adaptation
quality and training speed.Comment: Accepted to INTERSPEECH 202
- âŠ