2,185 research outputs found
Transfer Learning for Speech and Language Processing
Transfer learning is a vital technique that generalizes models trained for
one setting or task to other settings or tasks. For example in speech
recognition, an acoustic model trained for one language can be used to
recognize speech in another language, with little or no re-training data.
Transfer learning is closely related to multi-task learning (cross-lingual vs.
multilingual), and is traditionally studied in the name of `model adaptation'.
Recent advance in deep learning shows that transfer learning becomes much
easier and more effective with high-level abstract features learned by deep
models, and the `transfer' can be conducted not only between data distributions
and data types, but also between model structures (e.g., shallow nets and deep
nets) or even model types (e.g., Bayesian models and neural models). This
review paper summarizes some recent prominent research towards this direction,
particularly for speech and language processing. We also report some results
from our group and highlight the potential of this very interesting research
field.Comment: 13 pages, APSIPA 201
Meta-Learning for Phonemic Annotation of Corpora
We apply rule induction, classifier combination and meta-learning (stacked
classifiers) to the problem of bootstrapping high accuracy automatic annotation
of corpora with pronunciation information. The task we address in this paper
consists of generating phonemic representations reflecting the Flemish and
Dutch pronunciations of a word on the basis of its orthographic representation
(which in turn is based on the actual speech recordings). We compare several
possible approaches to achieve the text-to-pronunciation mapping task:
memory-based learning, transformation-based learning, rule induction, maximum
entropy modeling, combination of classifiers in stacked learning, and stacking
of meta-learners. We are interested both in optimal accuracy and in obtaining
insight into the linguistic regularities involved. As far as accuracy is
concerned, an already high accuracy level (93% for Celex and 86% for Fonilex at
word level) for single classifiers is boosted significantly with additional
error reductions of 31% and 38% respectively using combination of classifiers,
and a further 5% using combination of meta-learners, bringing overall word
level accuracy to 96% for the Dutch variant and 92% for the Flemish variant. We
also show that the application of machine learning methods indeed leads to
increased insight into the linguistic regularities determining the variation
between the two pronunciation variants studied.Comment: 8 page
Speech synthesis, Speech simulation and speech science
Speech synthesis research has been transformed in recent years through the exploitation of speech corpora - both for statistical modelling and as a source of signals for concatenative synthesis. This revolution in methodology and the new techniques it brings calls into question the received wisdom that better computer voice output will come from a better understanding of how humans produce speech. This paper discusses the relationship between this new technology of simulated speech and the traditional aims of speech science. The paper suggests that the goal of speech simulation frees engineers from inadequate linguistic and physiological descriptions of speech. But at the same time, it leaves speech scientists free to return to their proper goal of building a computational model of human speech production
Optimizing expected word error rate via sampling for speech recognition
State-level minimum Bayes risk (sMBR) training has become the de facto
standard for sequence-level training of speech recognition acoustic models. It
has an elegant formulation using the expectation semiring, and gives large
improvements in word error rate (WER) over models trained solely using
cross-entropy (CE) or connectionist temporal classification (CTC). sMBR
training optimizes the expected number of frames at which the reference and
hypothesized acoustic states differ. It may be preferable to optimize the
expected WER, but WER does not interact well with the expectation semiring, and
previous approaches based on computing expected WER exactly involve expanding
the lattices used during training. In this paper we show how to perform
optimization of the expected WER by sampling paths from the lattices used
during conventional sMBR training. The gradient of the expected WER is itself
an expectation, and so may be approximated using Monte Carlo sampling. We show
experimentally that optimizing WER during acoustic model training gives 5%
relative improvement in WER over a well-tuned sMBR baseline on a 2-channel
query recognition task (Google Home)
Comparison between rule-based and data-driven natural language processing algorithms for Brazilian Portuguese speech synthesis
Due to the exponential growth in the use of computers, personal digital assistants and smartphones, the development of Text-to-Speech (TTS) systems have become highly demanded during the last years. An important part of these systems is the Text Analysis block, that converts the input text into linguistic specifications that are going to be used to generate the final speech waveform. The Natural Language Processing algorithms presented in this block are crucial to the quality of the speech generated by synthesizers. These algorithms are responsible for important tasks such as Grapheme-to-Phoneme Conversion, Syllabification and Stress Determination. For Brazilian Portuguese (BP), solutions for the algorithms presented in the Text Analysis block have been focused in rule-based approaches. These algorithms perform well for BP but have many disadvantages. On the other hand, there is still no research to evaluate and analyze the performance of data-driven approaches that reach state-of-the-art results for complex languages, such as English. So, in this work, we compare different data-driven approaches and rule-based approaches for NLP algorithms presented in a TTS system. Moreover, we propose, as a novel application, the use of Sequence-to-Sequence models as solution for the Syllabification and Stress Determination problems. As a brief summary of the results obtained, we show that data-driven algorithms can achieve state-of-the-art performance for the NLP algorithms presented in the Text Analysis block of a BP TTS system.Nos Ăşltimos anos, devido ao grande crescimento no uso de computadores, assistentes pessoais e smartphones, o desenvolvimento de sistemas capazes de converter texto em fala tem sido bastante demandado. O bloco de análise de texto, onde o texto de entrada Ă© convertido em especificações linguĂsticas usadas para gerar a onda sonora final Ă© uma parte importante destes sistemas. O desempenho dos algoritmos de Processamento de Linguagem Natural (NLP) presentes neste bloco Ă© crucial para a qualidade dos sintetizadores de voz. ConversĂŁo Grafema-Fonema, separação silábica e determinação da sĂlaba tĂ´nica sĂŁo algumas das tarefas executadas por estes algoritmos. Para o PortuguĂŞs Brasileiro (BP), os algoritmos baseados em regras tĂŞm sido o foco na solução destes problemas. Estes algoritmos atingem bom desempenho para o BP, contudo apresentam diversas desvantagens. Por outro lado, ainda nĂŁo há pesquisa no intuito de avaliar o desempenho de algoritmos data-driven, largamente utilizados para lĂnguas complexas, como o inglĂŞs. Desta forma, expõe-se neste trabalho uma comparação entre diferentes tĂ©cnicas data-driven e baseadas em regras para algoritmos de NLP utilizados em um sintetizador de voz. AlĂ©m disso, propõe o uso de Sequence-to-Sequence models para a separação silábica e a determinação da tonicidade. Em suma, o presente trabalho demonstra que o uso de algoritmos data-driven atinge o estado-da-arte na performance dos algoritmos de Processamento de Linguagem Natural de um sintetizador de voz para o PortuguĂŞs Brasileiro
- …