15,928 research outputs found

    Transfer Learning for Speech and Language Processing

    Full text link
    Transfer learning is a vital technique that generalizes models trained for one setting or task to other settings or tasks. For example in speech recognition, an acoustic model trained for one language can be used to recognize speech in another language, with little or no re-training data. Transfer learning is closely related to multi-task learning (cross-lingual vs. multilingual), and is traditionally studied in the name of `model adaptation'. Recent advance in deep learning shows that transfer learning becomes much easier and more effective with high-level abstract features learned by deep models, and the `transfer' can be conducted not only between data distributions and data types, but also between model structures (e.g., shallow nets and deep nets) or even model types (e.g., Bayesian models and neural models). This review paper summarizes some recent prominent research towards this direction, particularly for speech and language processing. We also report some results from our group and highlight the potential of this very interesting research field.Comment: 13 pages, APSIPA 201

    Transfer learning of language-independent end-to-end ASR with language model fusion

    Full text link
    This work explores better adaptation methods to low-resource languages using an external language model (LM) under the framework of transfer learning. We first build a language-independent ASR system in a unified sequence-to-sequence (S2S) architecture with a shared vocabulary among all languages. During adaptation, we perform LM fusion transfer, where an external LM is integrated into the decoder network of the attention-based S2S model in the whole adaptation stage, to effectively incorporate linguistic context of the target language. We also investigate various seed models for transfer learning. Experimental evaluations using the IARPA BABEL data set show that LM fusion transfer improves performances on all target five languages compared with simple transfer learning when the external text data is available. Our final system drastically reduces the performance gap from the hybrid systems.Comment: Accepted at ICASSP201

    Statistical text-to-speech synthesis of Spanish subtitles

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-13623-3_5Online multimedia repositories are growing rapidly. However, language barriers are often difficult to overcome for many of the current and potential users. In this paper we describe a TTS Spanish sys- tem and we apply it to the synthesis of transcribed and translated video lectures. A statistical parametric speech synthesis system, in which the acoustic mapping is performed with either HMM-based or DNN-based acoustic models, has been developed. To the best of our knowledge, this is the first time that a DNN-based TTS system has been implemented for the synthesis of Spanish. A comparative objective evaluation between both models has been carried out. Our results show that DNN-based systems can reconstruct speech waveforms more accurately.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 287755 (transLectures) and ICT Policy Support Programme (ICT PSP/2007-2013) as part of the Competitiveness and Innovation Framework Programme (CIP) under grant agreement no 621030 (EMMA), and the Spanish MINECO Active2Trans (TIN2012-31723) research project.Piqueras Gozalbes, SR.; Del Agua Teba, MA.; Giménez Pastor, A.; Civera Saiz, J.; Juan Císcar, A. (2014). Statistical text-to-speech synthesis of Spanish subtitles. En Advances in Speech and Language Technologies for Iberian Languages: Second International Conference, IberSPEECH 2014, Las Palmas de Gran Canaria, Spain, November 19-21, 2014. Proceedings. Springer International Publishing. 40-48. https://doi.org/10.1007/978-3-319-13623-3_5S4048Ahocoder, http://aholab.ehu.es/ahocoderCoursera, http://www.coursera.orgHMM-Based Speech Synthesis System (HTS), http://hts.sp.nitech.ac.jpKhan Academy, http://www.khanacademy.orgAxelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proc. of EMNLP, pp. 355–362 (2011)Bottou, L.: Stochastic gradient learning in neural networks. In: Proceedings of Neuro-Nîmes 1991. EC2, Nimes, France (1991)Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 20(1), 30–42 (2012)Erro, D., Sainz, I., Navas, E., Hernaez, I.: Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE Journal of Selected Topics in Signal Processing 8(2), 184–194 (2014)Fan, Y., Qian, Y., Xie, F., Soong, F.: TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Proc. of Interspeech (submitted 2014)Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., et al.: Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29(6), 82–97 (2012)Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proc. of ICASSP, vol. 1, pp. 373–376 (1996)King, S.: Measuring a decade of progress in text-to-speech. Loquens 1(1), e006 (2014)Koehn, P.: Statistical Machine Translation. Cambridge University Press (2010)Kominek, J., Schultz, T., Black, A.W.: Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion. In: Proc. of SLTU, pp. 63–68 (2008)Lopez, A.: Statistical machine translation. ACM Computing Surveys 40(3), 8:1–8:49 (2008)poliMedia: The polimedia video-lecture repository (2007), http://media.upv.esSainz, I., Erro, D., Navas, E., Hernáez, I., Sánchez, J., Saratxaga, I.: Aholab speech synthesizer for albayzin 2012 speech synthesis evaluation. In: Proc. of IberSPEECH, pp. 645–652 (2012)Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent dnn for conversational speech transcription. In: Proc. of ASRU, pp. 24–29 (2011)Shinoda, K., Watanabe, T.: MDL-based context-dependent subword modeling for speech recognition. Journal of the Acoustical Society of Japan 21(2), 79–86 (2000)Silvestre-Cerdà, J.A., et al.: Translectures. In: Proc. of IberSPEECH, pp. 345–351 (2012)TED Ideas worth spreading, http://www.ted.comThe transLectures-UPV Team.: The transLectures-UPV toolkit (TLK), http://translectures.eu/tlkToda, T., Black, A.W., Tokuda, K.: Mapping from articulatory movements to vocal tract spectrum with Gaussian mixture model for articulatory speech synthesis. In: Proc. of ISCA Speech Synthesis Workshop (2004)Tokuda, K., Kobayashi, T., Imai, S.: Speech parameter generation from hmm using dynamic features. In: Proc. of ICASSP, vol. 1, pp. 660–663 (1995)Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T.: Multi-space probability distribution HMM. IEICE Transactions on Information and Systems 85(3), 455–464 (2002)transLectures: D3.1.2: Second report on massive adaptation, http://www.translectures.eu/wp-content/uploads/2014/01/transLectures-D3.1.2-15Nov2013.pdfTurró, C., Ferrando, M., Busquets, J., Cañero, A.: Polimedia: a system for successful video e-learning. In: Proc. of EUNIS (2009)Videolectures.NET: Exchange ideas and share knowledge, http://www.videolectures.netWu, Y.J., King, S., Tokuda, K.: Cross-lingual speaker adaptation for HMM-based speech synthesis. In: Proc. of ISCSLP, pp. 1–4 (2008)Yamagishi, J.: An introduction to HMM-based speech synthesis. Tech. rep. Centre for Speech Technology Research (2006), https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/TrajectoryModelling/HTS-Introduction.pdfYoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In: Proc. of Eurospeech, pp. 2347–2350 (1999)Zen, H., Senior, A.: Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In: Proc. of ICASSP, pp. 3872–3876 (2014)Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proc. of ICASSP, pp. 7962–7966 (2013)Zen, H., Tokuda, K., Black, A.W.: Statistical parametric speech synthesis. Speech Communication 51(11), 1039–1064 (2009

    Low-resource speech translation

    Get PDF
    We explore the task of speech-to-text translation (ST), where speech in one language (source) is converted to text in a different one (target). Traditional ST systems go through an intermediate step where the source language speech is first converted to source language text using an automatic speech recognition (ASR) system, which is then converted to target language text using a machine translation (MT) system. However, this pipeline based approach is impractical for unwritten languages spoken by millions of people around the world, leaving them without access to free and automated translation services such as Google Translate. The lack of such translation services can have important real-world consequences. For example, in the aftermath of a disaster scenario, easily available translation services can help better co-ordinate relief efforts. How can we expand the coverage of automated ST systems to include scenarios which lack source language text? In this thesis we investigate one possible solution: we build ST systems to directly translate source language speech into target language text, thereby forgoing the dependency on source language text. To build such a system, we use only speech data paired with text translations as training data. We also specifically focus on low-resource settings, where we expect at most tens of hours of training data to be available for unwritten or endangered languages. Our work can be broadly divided into three parts. First we explore how we can leverage prior work to build ST systems. We find that neural sequence-to-sequence models are an effective and convenient method for ST, but produce poor quality translations when trained in low-resource settings. In the second part of this thesis, we explore methods to improve the translation performance of our neural ST systems which do not require labeling additional speech data in the low-resource language, a potentially tedious and expensive process. Instead we exploit labeled speech data for high-resource languages which is widely available and relatively easier to obtain. We show that pretraining a neural model with ASR data from a high-resource language, different from both the source and target ST languages, improves ST performance. In the final part of our thesis, we study whether ST systems can be used to build applications which have traditionally relied on the availability of ASR systems, such as information retrieval, clustering audio documents, or question/answering. We build proof-of-concept systems for two downstream applications: topic prediction for speech and cross-lingual keyword spotting. Our results indicate that low-resource ST systems can still outperform simple baselines for these tasks, leaving the door open for further exploratory work. This thesis provides, for the first time, an in-depth study of neural models for the task of direct ST across a range of training data settings on a realistic multi-speaker speech corpus. Our contributions include a set of open-source tools to encourage further research
    corecore