995 research outputs found
Few-Shot and Zero-Shot Learning for Historical Text Normalization
Historical text normalization often relies on small training datasets. Recent
work has shown that multi-task learning can lead to significant improvements by
exploiting synergies with related datasets, but there has been no systematic
study of different multi-task learning architectures. This paper evaluates
63~multi-task learning configurations for sequence-to-sequence-based historical
text normalization across ten datasets from eight languages, using
autoencoding, grapheme-to-phoneme mapping, and lemmatization as auxiliary
tasks. We observe consistent, significant improvements across languages when
training data for the target task is limited, but minimal or no improvements
when training data is abundant. We also show that zero-shot learning
outperforms the simple, but relatively strong, identity baseline.Comment: Accepted at DeepLo-201
Hard Non-Monotonic Attention for Character-Level Transduction
Character-level string-to-string transduction is an important component of
various NLP tasks. The goal is to map an input string to an output string,
where the strings may be of different lengths and have characters taken from
different alphabets. Recent approaches have used sequence-to-sequence models
with an attention mechanism to learn which parts of the input string the model
should focus on during the generation of the output string. Both soft attention
and hard monotonic attention have been used, but hard non-monotonic attention
has only been used in other sequence modeling tasks such as image captioning
and has required a stochastic approximation to compute the gradient. In this
work, we introduce an exact, polynomial-time algorithm for marginalizing over
the exponential number of non-monotonic alignments between two strings, showing
that hard attention models can be viewed as neural reparameterizations of the
classical IBM Model 1. We compare soft and hard non-monotonic attention
experimentally and find that the exact algorithm significantly improves
performance over the stochastic approximation and outperforms soft attention.Comment: Published in EMNLP 201
Comparison between rule-based and data-driven natural language processing algorithms for Brazilian Portuguese speech synthesis
Due to the exponential growth in the use of computers, personal digital assistants and smartphones, the development of Text-to-Speech (TTS) systems have become highly demanded during the last years. An important part of these systems is the Text Analysis block, that converts the input text into linguistic specifications that are going to be used to generate the final speech waveform. The Natural Language Processing algorithms presented in this block are crucial to the quality of the speech generated by synthesizers. These algorithms are responsible for important tasks such as Grapheme-to-Phoneme Conversion, Syllabification and Stress Determination. For Brazilian Portuguese (BP), solutions for the algorithms presented in the Text Analysis block have been focused in rule-based approaches. These algorithms perform well for BP but have many disadvantages. On the other hand, there is still no research to evaluate and analyze the performance of data-driven approaches that reach state-of-the-art results for complex languages, such as English. So, in this work, we compare different data-driven approaches and rule-based approaches for NLP algorithms presented in a TTS system. Moreover, we propose, as a novel application, the use of Sequence-to-Sequence models as solution for the Syllabification and Stress Determination problems. As a brief summary of the results obtained, we show that data-driven algorithms can achieve state-of-the-art performance for the NLP algorithms presented in the Text Analysis block of a BP TTS system.Nos Ăşltimos anos, devido ao grande crescimento no uso de computadores, assistentes pessoais e smartphones, o desenvolvimento de sistemas capazes de converter texto em fala tem sido bastante demandado. O bloco de análise de texto, onde o texto de entrada Ă© convertido em especificações linguĂsticas usadas para gerar a onda sonora final Ă© uma parte importante destes sistemas. O desempenho dos algoritmos de Processamento de Linguagem Natural (NLP) presentes neste bloco Ă© crucial para a qualidade dos sintetizadores de voz. ConversĂŁo Grafema-Fonema, separação silábica e determinação da sĂlaba tĂ´nica sĂŁo algumas das tarefas executadas por estes algoritmos. Para o PortuguĂŞs Brasileiro (BP), os algoritmos baseados em regras tĂŞm sido o foco na solução destes problemas. Estes algoritmos atingem bom desempenho para o BP, contudo apresentam diversas desvantagens. Por outro lado, ainda nĂŁo há pesquisa no intuito de avaliar o desempenho de algoritmos data-driven, largamente utilizados para lĂnguas complexas, como o inglĂŞs. Desta forma, expõe-se neste trabalho uma comparação entre diferentes tĂ©cnicas data-driven e baseadas em regras para algoritmos de NLP utilizados em um sintetizador de voz. AlĂ©m disso, propõe o uso de Sequence-to-Sequence models para a separação silábica e a determinação da tonicidade. Em suma, o presente trabalho demonstra que o uso de algoritmos data-driven atinge o estado-da-arte na performance dos algoritmos de Processamento de Linguagem Natural de um sintetizador de voz para o PortuguĂŞs Brasileiro
Text Preprocessing for Speech Synthesis
In this paper we describe our text preprocessing modules for English text-to-speech synthesis. These modules comprise rule-based text normalization subsuming sentence segmentation and normalization of non-standard words, statistical part-of-speech tagging, and statistical syllabification, grapheme-to-phoneme conversion, and word stress assignment relying in parts on rule-based morphological analysis
Towards a unified model for speech and language processing
Ce travail de recherche explore les méthodes d’apprentissage profond de la parole et du
langage, y inclus la reconnaissance et la synthèse de la parole, la conversion des graphèmes en
phonèmes et vice-versa, les modèles génératifs, visant de reformuler des tâches spécifiques dans
un problème plus général de trouver une représentation universelle d’information contenue
dans chaque modalité et de transférer un signal d’une modalité à une autre en se servant de
telles représentations universelles et à générer des représentations dans plusieurs modalités.
Il est compris de deux projets de recherche: 1) SoundChoice, un modèle graphème-phonème
tenant compte du contexte au niveau de la phrase qui réalise de bonnes performances et
des améliorations remarquables comparativement à un modèle de base et 2) MAdmixture, une
nouvelle approche pour apprendre des représentations multimodales dans un espace latent
commun.The present work explores the use of deep learning methods applied to a variety of areas
in speech and language processing including speech recognition, grapheme-to-phoneme conversion,
speech synthesis, generative models for speech and others to build toward a unified
approach that reframes these individual tasks into a more general problem of finding a
universal representation of information encoded in different modalities and being able to
seamlessly transfer a signal from one modality to another by converting it to this universal
representations and to generate samples in multiple modalities. It consists of two main
research projects: 1) SoundChocice, a context-aware sentence level Grapheme-to-Phoneme
model achieving solid performance on the task and a significant improvement on phoneme
disambiguation over baseline models and 2) MAdmixture, a novel approach to learning a variety
of speech representations in a common latent space
Zero-shot keyword spotting for visual speech recognition in-the-wild
Visual keyword spotting (KWS) is the problem of estimating whether a text
query occurs in a given recording using only video information. This paper
focuses on visual KWS for words unseen during training, a real-world, practical
setting which so far has received no attention by the community. To this end,
we devise an end-to-end architecture comprising (a) a state-of-the-art visual
feature extractor based on spatiotemporal Residual Networks, (b) a
grapheme-to-phoneme model based on sequence-to-sequence neural networks, and
(c) a stack of recurrent neural networks which learn how to correlate visual
features with the keyword representation. Different to prior works on KWS,
which try to learn word representations merely from sequences of graphemes
(i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder
model which learns how to map words to their pronunciation. We demonstrate that
our system obtains very promising visual-only KWS results on the challenging
LRS2 database, for keywords unseen during training. We also show that our
system outperforms a baseline which addresses KWS via automatic speech
recognition (ASR), while it drastically improves over other recently proposed
ASR-free KWS methods.Comment: Accepted at ECCV-201
- …