113,666 research outputs found
Effectiveness of self-supervised pre-training for speech recognition
We compare self-supervised representation learning algorithms which either
explicitly quantize the audio data or learn representations without
quantization. We find the former to be more accurate since it builds a good
vocabulary of the data through vq-wav2vec [1] to enable learning of effective
representations in subsequent BERT training. Different to previous work, we
directly fine-tune the pre-trained BERT models on transcribed speech using a
Connectionist Temporal Classification (CTC) loss instead of feeding the
representations into a task-specific model. We also propose a BERT-style model
learning directly from the continuous audio data and compare pre-training on
raw audio to spectral features. Fine-tuning a BERT model on 10 hour of labeled
Librispeech data with a vq-wav2vec vocabulary is almost as good as the best
known reported system trained on 100 hours of labeled data on testclean, while
achieving a 25% WER reduction on test-other. When using only 10 minutes of
labeled data, WER is 25.2 on test-other and 16.3 on test-clean. This
demonstrates that self-supervision can enable speech recognition systems
trained on a near-zero amount of transcribed data
SignsWorld; Deeping Into the Silence World and Hearing Its Signs (State of the Art)
Automatic speech processing systems are employed more and more often in real
environments. Although the underlying speech technology is mostly language
independent, differences between languages with respect to their structure and
grammar have substantial effect on the recognition systems performance. In this
paper, we present a review of the latest developments in the sign language
recognition research in general and in the Arabic sign language (ArSL) in
specific. This paper also presents a general framework for improving the deaf
community communication with the hearing people that is called SignsWorld. The
overall goal of the SignsWorld project is to develop a vision-based technology
for recognizing and translating continuous Arabic sign language ArSL.Comment: 20 pages, A state of art paper so it contains many reference
Spatial Concept Acquisition for a Mobile Robot that Integrates Self-Localization and Unsupervised Word Discovery from Spoken Sentences
In this paper, we propose a novel unsupervised learning method for the
lexical acquisition of words related to places visited by robots, from human
continuous speech signals. We address the problem of learning novel words by a
robot that has no prior knowledge of these words except for a primitive
acoustic model. Further, we propose a method that allows a robot to effectively
use the learned words and their meanings for self-localization tasks. The
proposed method is nonparametric Bayesian spatial concept acquisition method
(SpCoA) that integrates the generative model for self-localization and the
unsupervised word segmentation in uttered sentences via latent variables
related to the spatial concept. We implemented the proposed method SpCoA on
SIGVerse, which is a simulation environment, and TurtleBot2, which is a mobile
robot in a real environment. Further, we conducted experiments for evaluating
the performance of SpCoA. The experimental results showed that SpCoA enabled
the robot to acquire the names of places from speech sentences. They also
revealed that the robot could effectively utilize the acquired spatial concepts
and reduce the uncertainty in self-localization.Comment: This paper was accepted in the IEEE Transactions on Cognitive and
Developmental Systems. (04-May-2016
Likelihood-based semi-supervised model selection with applications to speech processing
In conventional supervised pattern recognition tasks, model selection is
typically accomplished by minimizing the classification error rate on a set of
so-called development data, subject to ground-truth labeling by human experts
or some other means. In the context of speech processing systems and other
large-scale practical applications, however, such labeled development data are
typically costly and difficult to obtain. This article proposes an alternative
semi-supervised framework for likelihood-based model selection that leverages
unlabeled data by using trained classifiers representing each model to
automatically generate putative labels. The errors that result from this
automatic labeling are shown to be amenable to results from robust statistics,
which in turn provide for minimax-optimal censored likelihood ratio tests that
recover the nonparametric sign test as a limiting case. This approach is then
validated experimentally using a state-of-the-art automatic speech recognition
system to select between candidate word pronunciations using unlabeled speech
data that only potentially contain instances of the words under test. Results
provide supporting evidence for the utility of this approach, and suggest that
it may also find use in other applications of machine learning.Comment: 11 pages, 2 figures; submitted for publicatio
End-to-End Speech Translation with Knowledge Distillation
End-to-end speech translation (ST), which directly translates from source
language speech into target language text, has attracted intensive attentions
in recent years. Compared to conventional pipeline systems, end-to-end ST
models have advantages of lower latency, smaller model size and less error
propagation. However, the combination of speech recognition and text
translation in one model is more difficult than each of these two tasks. In
this paper, we propose a knowledge distillation approach to improve ST model by
transferring the knowledge from text translation model. Specifically, we first
train a text translation model, regarded as a teacher model, and then ST model
is trained to learn output probabilities from teacher model through knowledge
distillation. Experiments on English- French Augmented LibriSpeech and
English-Chinese TED corpus show that end-to-end ST is possible to implement on
both similar and dissimilar language pairs. In addition, with the instruction
of teacher model, end-to-end ST model can gain significant improvements by over
3.5 BLEU points.Comment: Submitted to Interspeech 201
Neural Polysynthetic Language Modelling
Research in natural language processing commonly assumes that approaches that
work well for English and and other widely-used languages are "language
agnostic". In high-resource languages, especially those that are analytic, a
common approach is to treat morphologically-distinct variants of a common root
as completely independent word types. This assumes, that there are limited
morphological inflections per root, and that the majority will appear in a
large enough corpus, so that the model can adequately learn statistics about
each form. Approaches like stemming, lemmatization, or subword segmentation are
often used when either of those assumptions do not hold, particularly in the
case of synthetic languages like Spanish or Russian that have more inflection
than English.
In the literature, languages like Finnish or Turkish are held up as extreme
examples of complexity that challenge common modelling assumptions. Yet, when
considering all of the world's languages, Finnish and Turkish are closer to the
average case. When we consider polysynthetic languages (those at the extreme of
morphological complexity), approaches like stemming, lemmatization, or subword
modelling may not suffice. These languages have very high numbers of hapax
legomena, showing the need for appropriate morphological handling of words,
without which it is not possible for a model to capture enough word statistics.
We examine the current state-of-the-art in language modelling, machine
translation, and text prediction for four polysynthetic languages: Guaran\'i,
St. Lawrence Island Yupik, Central Alaskan Yupik, and Inuktitut. We then
propose a novel framework for language modelling that combines knowledge
representations from finite-state morphological analyzers with Tensor Product
Representations in order to enable neural language models capable of handling
the full range of typologically variant languages
AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale
AISHELL-1 is by far the largest open-source speech corpus available for
Mandarin speech recognition research. It was released with a baseline system
containing solid training and testing pipelines for Mandarin ASR. In AISHELL-2,
1000 hours of clean read-speech data from iOS is published, which is free for
academic usage. On top of AISHELL-2 corpus, an improved recipe is developed and
released, containing key components for industrial applications, such as
Chinese word segmentation, flexible vocabulary expension and phone set
transformation etc. Pipelines support various state-of-the-art techniques, such
as time-delayed neural networks and Lattic-Free MMI objective funciton. In
addition, we also release dev and test data from other channels(Android and
Mic). For research community, we hope that AISHELL-2 corpus can be a solid
resource for topics like transfer learning and robust ASR. For industry, we
hope AISHELL-2 recipe can be a helpful reference for building meaningful
industrial systems and products
Unsupervised Discovery of Linguistic Structure Including Two-level Acoustic Patterns Using Three Cascaded Stages of Iterative Optimization
Techniques for unsupervised discovery of acoustic patterns are getting
increasingly attractive, because huge quantities of speech data are becoming
available but manual annotations remain hard to acquire. In this paper, we
propose an approach for unsupervised discovery of linguistic structure for the
target spoken language given raw speech data. This linguistic structure
includes two-level (subword-like and word-like) acoustic patterns, the lexicon
of word-like patterns in terms of subword-like patterns and the N-gram language
model based on word-like patterns. All patterns, models, and parameters can be
automatically learned from the unlabelled speech corpus. This is achieved by an
initialization step followed by three cascaded stages for acoustic, linguistic,
and lexical iterative optimization. The lexicon of word-like patterns defines
allowed consecutive sequence of HMMs for subword-like patterns. In each
iteration, model training and decoding produces updated labels from which the
lexicon and HMMs can be further updated. In this way, model parameters and
decoded labels are respectively optimized in each iteration, and the knowledge
about the linguistic structure is learned gradually layer after layer. The
proposed approach was tested in preliminary experiments on a corpus of Mandarin
broadcast news, including a task of spoken term detection with performance
compared to a parallel test using models trained in a supervised way. Results
show that the proposed system not only yields reasonable performance on its
own, but is also complimentary to existing large vocabulary ASR systems.Comment: Accepted by ICASSP 201
Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces
This paper presents the machine learning architecture of the Snips Voice
Platform, a software solution to perform Spoken Language Understanding on
microprocessors typical of IoT devices. The embedded inference is fast and
accurate while enforcing privacy by design, as no personal user data is ever
collected. Focusing on Automatic Speech Recognition and Natural Language
Understanding, we detail our approach to training high-performance Machine
Learning models that are small enough to run in real-time on small devices.
Additionally, we describe a data generation procedure that provides sufficient,
high-quality training data without compromising user privacy.Comment: 29 pages, 9 figures, 17 table
Speech Recognition by Machine, A Review
This paper presents a brief survey on Automatic Speech Recognition and
discusses the major themes and advances made in the past 60 years of research,
so as to provide a technological perspective and an appreciation of the
fundamental progress that has been accomplished in this important area of
speech communication. After years of research and development the accuracy of
automatic speech recognition remains one of the important research challenges
(e.g., variations of the context, speakers, and environment).The design of
Speech Recognition system requires careful attentions to the following issues:
Definition of various types of speech classes, speech representation, feature
extraction techniques, speech classifiers, database and performance evaluation.
The problems that are existing in ASR and the various techniques to solve these
problems constructed by various research workers have been presented in a
chronological order. Hence authors hope that this work shall be a contribution
in the area of speech recognition. The objective of this review paper is to
summarize and compare some of the well known methods used in various stages of
speech recognition system and identify research topic and applications which
are at the forefront of this exciting and challenging field.Comment: 25 pages IEEE format, International Journal of Computer Science and
Information Security, IJCSIS December 2009, ISSN 1947 5500,
http://sites.google.com/site/ijcsis
- …