Search CORE

113,666 research outputs found

Effectiveness of self-supervised pre-training for speech recognition

Author: Auli Michael
Baevski Alexei
Mohamed Abdelrahman
Publication venue
Publication date: 18/05/2020
Field of study

We compare self-supervised representation learning algorithms which either explicitly quantize the audio data or learn representations without quantization. We find the former to be more accurate since it builds a good vocabulary of the data through vq-wav2vec [1] to enable learning of effective representations in subsequent BERT training. Different to previous work, we directly fine-tune the pre-trained BERT models on transcribed speech using a Connectionist Temporal Classification (CTC) loss instead of feeding the representations into a task-specific model. We also propose a BERT-style model learning directly from the continuous audio data and compare pre-training on raw audio to spectral features. Fine-tuning a BERT model on 10 hour of labeled Librispeech data with a vq-wav2vec vocabulary is almost as good as the best known reported system trained on 100 hours of labeled data on testclean, while achieving a 25% WER reduction on test-other. When using only 10 minutes of labeled data, WER is 25.2 on test-other and 16.3 on test-clean. This demonstrates that self-supervision can enable speech recognition systems trained on a near-zero amount of transcribed data

arXiv.org e-Print Archive

SignsWorld; Deeping Into the Silence World and Hearing Its Signs (State of the Art)

Author: Asem A. S.
Elmonier Hamdy K.
Riad A. M.
Shohieb Samaa. M.
Publication venue
Publication date: 10/03/2012
Field of study

Automatic speech processing systems are employed more and more often in real environments. Although the underlying speech technology is mostly language independent, differences between languages with respect to their structure and grammar have substantial effect on the recognition systems performance. In this paper, we present a review of the latest developments in the sign language recognition research in general and in the Arabic sign language (ArSL) in specific. This paper also presents a general framework for improving the deaf community communication with the hearing people that is called SignsWorld. The overall goal of the SignsWorld project is to develop a vision-based technology for recognizing and translating continuous Arabic sign language ArSL.Comment: 20 pages, A state of art paper so it contains many reference

arXiv.org e-Print Archive

Spatial Concept Acquisition for a Mobile Robot that Integrates Self-Localization and Unsupervised Word Discovery from Spoken Sentences

Author: Inamura Tetsunari
Taniguchi Akira
Taniguchi Tadahiro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/05/2016
Field of study

In this paper, we propose a novel unsupervised learning method for the lexical acquisition of words related to places visited by robots, from human continuous speech signals. We address the problem of learning novel words by a robot that has no prior knowledge of these words except for a primitive acoustic model. Further, we propose a method that allows a robot to effectively use the learned words and their meanings for self-localization tasks. The proposed method is nonparametric Bayesian spatial concept acquisition method (SpCoA) that integrates the generative model for self-localization and the unsupervised word segmentation in uttered sentences via latent variables related to the spatial concept. We implemented the proposed method SpCoA on SIGVerse, which is a simulation environment, and TurtleBot2, which is a mobile robot in a real environment. Further, we conducted experiments for evaluating the performance of SpCoA. The experimental results showed that SpCoA enabled the robot to acquire the names of places from speech sentences. They also revealed that the robot could effectively utilize the acquired spatial concepts and reduce the uncertainty in self-localization.Comment: This paper was accepted in the IEEE Transactions on Cognitive and Developmental Systems. (04-May-2016

arXiv.org e-Print Archive

Likelihood-based semi-supervised model selection with applications to speech processing

Author: Khudanpur Sanjeev P.
White Christopher M.
Wolfe Patrick J.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/11/2009
Field of study

In conventional supervised pattern recognition tasks, model selection is typically accomplished by minimizing the classification error rate on a set of so-called development data, subject to ground-truth labeling by human experts or some other means. In the context of speech processing systems and other large-scale practical applications, however, such labeled development data are typically costly and difficult to obtain. This article proposes an alternative semi-supervised framework for likelihood-based model selection that leverages unlabeled data by using trained classifiers representing each model to automatically generate putative labels. The errors that result from this automatic labeling are shown to be amenable to results from robust statistics, which in turn provide for minimax-optimal censored likelihood ratio tests that recover the nonparametric sign test as a limiting case. This approach is then validated experimentally using a state-of-the-art automatic speech recognition system to select between candidate word pronunciations using unlabeled speech data that only potentially contain instances of the words under test. Results provide supporting evidence for the utility of this approach, and suggest that it may also find use in other applications of machine learning.Comment: 11 pages, 2 figures; submitted for publicatio

arXiv.org e-Print Archive

UCL Discovery

End-to-End Speech Translation with Knowledge Distillation

Author: He Zhongjun
Liu Yuchen
Wang Haifeng
Wu Hua
Xiong Hao
Zhang Jiajun
Zong Chengqing
Publication venue
Publication date: 17/04/2019
Field of study

End-to-end speech translation (ST), which directly translates from source language speech into target language text, has attracted intensive attentions in recent years. Compared to conventional pipeline systems, end-to-end ST models have advantages of lower latency, smaller model size and less error propagation. However, the combination of speech recognition and text translation in one model is more difficult than each of these two tasks. In this paper, we propose a knowledge distillation approach to improve ST model by transferring the knowledge from text translation model. Specifically, we first train a text translation model, regarded as a teacher model, and then ST model is trained to learn output probabilities from teacher model through knowledge distillation. Experiments on English- French Augmented LibriSpeech and English-Chinese TED corpus show that end-to-end ST is possible to implement on both similar and dissimilar language pairs. In addition, with the instruction of teacher model, end-to-end ST model can gain significant improvements by over 3.5 BLEU points.Comment: Submitted to Interspeech 201

arXiv.org e-Print Archive

Neural Polysynthetic Language Modelling

Author: Andriyanets Vasilisa
Haley Coleman
Jimmerson Robbie
Kirov Christo
Knowles Rebecca
Levin Lori
Littell Patrick
Liu Han
Lo Chi-kiu
Micher Jeffrey
Muis Aldrian Obaja
Otani Naoki
Park Hyunji Hayley
Park Jong Hyuk
Prud'hommeaux Emily
Schwartz Lane
Steimel Kenneth
Strunk Lonny
Tyers Francis
Zhang Katherine J.
Zhang Zhisong
Publication venue
Publication date: 13/05/2020
Field of study

Research in natural language processing commonly assumes that approaches that work well for English and and other widely-used languages are "language agnostic". In high-resource languages, especially those that are analytic, a common approach is to treat morphologically-distinct variants of a common root as completely independent word types. This assumes, that there are limited morphological inflections per root, and that the majority will appear in a large enough corpus, so that the model can adequately learn statistics about each form. Approaches like stemming, lemmatization, or subword segmentation are often used when either of those assumptions do not hold, particularly in the case of synthetic languages like Spanish or Russian that have more inflection than English. In the literature, languages like Finnish or Turkish are held up as extreme examples of complexity that challenge common modelling assumptions. Yet, when considering all of the world's languages, Finnish and Turkish are closer to the average case. When we consider polysynthetic languages (those at the extreme of morphological complexity), approaches like stemming, lemmatization, or subword modelling may not suffice. These languages have very high numbers of hapax legomena, showing the need for appropriate morphological handling of words, without which it is not possible for a model to capture enough word statistics. We examine the current state-of-the-art in language modelling, machine translation, and text prediction for four polysynthetic languages: Guaran\'i, St. Lawrence Island Yupik, Central Alaskan Yupik, and Inuktitut. We then propose a novel framework for language modelling that combines knowledge representations from finite-state morphological analyzers with Tensor Product Representations in order to enable neural language models capable of handling the full range of typologically variant languages

arXiv.org e-Print Archive

AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale

Author: Bu Hui
Du Jiayu
Liu Xuechen
Na Xingyu
Publication venue
Publication date: 12/09/2018
Field of study

AISHELL-1 is by far the largest open-source speech corpus available for Mandarin speech recognition research. It was released with a baseline system containing solid training and testing pipelines for Mandarin ASR. In AISHELL-2, 1000 hours of clean read-speech data from iOS is published, which is free for academic usage. On top of AISHELL-2 corpus, an improved recipe is developed and released, containing key components for industrial applications, such as Chinese word segmentation, flexible vocabulary expension and phone set transformation etc. Pipelines support various state-of-the-art techniques, such as time-delayed neural networks and Lattic-Free MMI objective funciton. In addition, we also release dev and test data from other channels(Android and Mic). For research community, we hope that AISHELL-2 corpus can be a solid resource for topics like transfer learning and robust ASR. For industry, we hope AISHELL-2 recipe can be a helpful reference for building meaningful industrial systems and products

arXiv.org e-Print Archive

Unsupervised Discovery of Linguistic Structure Including Two-level Acoustic Patterns Using Three Cascaded Stages of Iterative Optimization

Author: Chan Chun-an
Chung Cheng-Tao
Lee Lin-shan
Publication venue
Publication date: 07/09/2015
Field of study

Techniques for unsupervised discovery of acoustic patterns are getting increasingly attractive, because huge quantities of speech data are becoming available but manual annotations remain hard to acquire. In this paper, we propose an approach for unsupervised discovery of linguistic structure for the target spoken language given raw speech data. This linguistic structure includes two-level (subword-like and word-like) acoustic patterns, the lexicon of word-like patterns in terms of subword-like patterns and the N-gram language model based on word-like patterns. All patterns, models, and parameters can be automatically learned from the unlabelled speech corpus. This is achieved by an initialization step followed by three cascaded stages for acoustic, linguistic, and lexical iterative optimization. The lexicon of word-like patterns defines allowed consecutive sequence of HMMs for subword-like patterns. In each iteration, model training and decoding produces updated labels from which the lexicon and HMMs can be further updated. In this way, model parameters and decoded labels are respectively optimized in each iteration, and the knowledge about the linguistic structure is learned gradually layer after layer. The proposed approach was tested in preliminary experiments on a corpus of Mandarin broadcast news, including a task of spoken term detection with performance compared to a parallel test using models trained in a supervised way. Results show that the proposed system not only yields reasonable performance on its own, but is also complimentary to existing large vocabulary ASR systems.Comment: Accepted by ICASSP 201

arXiv.org e-Print Archive

Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces

Author: Ball Adrien
Bluche Théodore
Caltagirone Francesco
Caulier Alexandre
Coucke Alice
Doumouro Clément
Dureau Joseph
Gisselbrecht Thibault
Lavril Thibaut
Leroy David
Primet Maël
Saade Alaa
Publication venue
Publication date: 06/12/2018
Field of study

This paper presents the machine learning architecture of the Snips Voice Platform, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices. The embedded inference is fast and accurate while enforcing privacy by design, as no personal user data is ever collected. Focusing on Automatic Speech Recognition and Natural Language Understanding, we detail our approach to training high-performance Machine Learning models that are small enough to run in real-time on small devices. Additionally, we describe a data generation procedure that provides sufficient, high-quality training data without compromising user privacy.Comment: 29 pages, 9 figures, 17 table

arXiv.org e-Print Archive

Speech Recognition by Machine, A Review

Author: Anusuya M. A.
Katti S. K.
Publication venue
Publication date: 13/01/2010
Field of study

This paper presents a brief survey on Automatic Speech Recognition and discusses the major themes and advances made in the past 60 years of research, so as to provide a technological perspective and an appreciation of the fundamental progress that has been accomplished in this important area of speech communication. After years of research and development the accuracy of automatic speech recognition remains one of the important research challenges (e.g., variations of the context, speakers, and environment).The design of Speech Recognition system requires careful attentions to the following issues: Definition of various types of speech classes, speech representation, feature extraction techniques, speech classifiers, database and performance evaluation. The problems that are existing in ASR and the various techniques to solve these problems constructed by various research workers have been presented in a chronological order. Hence authors hope that this work shall be a contribution in the area of speech recognition. The objective of this review paper is to summarize and compare some of the well known methods used in various stages of speech recognition system and identify research topic and applications which are at the forefront of this exciting and challenging field.Comment: 25 pages IEEE format, International Journal of Computer Science and Information Security, IJCSIS December 2009, ISSN 1947 5500, http://sites.google.com/site/ijcsis

arXiv.org e-Print Archive