59 research outputs found
Enhancing Word Representation Learning with Linguistic Knowledge
Representation learning, the process whereby representations are modelled from data, has recently become a central part of Natural Language Processing (NLP). Among the most widely used learned representations are word embeddings trained on large corpora of unannotated text, where the learned embeddings are treated as general representations that can be used across multiple NLP tasks. Despite their empirical successes, word embeddings learned entirely from data can only capture patterns of language usage from the particular linguistic domain of the training data. Linguistic knowledge, which does not vary among linguistic domains, can potentially be used to address this limitation. The vast sources of linguistic knowledge that are readily available nowadays can help train more general word embeddings (i.e. less affected by distance between linguistic domains) by providing them with such information as semantic relations, syntactic structure, word morphology, etc.
In this research, I investigate the different ways in which word embedding models capture and encode words’ semantic and contextual information. To this end, I propose two approaches to integrate linguistic knowledge into the statistical learning of word embeddings. The first approach is based on augmenting the training data for a well-known Skip-gram word embedding model, where synonym information is extracted from a lexical knowledge base and incorporated into the training data in the form of additional training examples. This data augmentation approach seeks to enforce synonym relations in the learned embeddings. The second approach exploits structural information in text by transforming every sentence in the data into its corresponding dependency parse trees and training an autoencoder to recover the original sentence. While learning a mapping from a dependency parse tree to its originating sentence, this novel Structure-to-Sequence (Struct2Seq) model produces word embeddings that contain information about a word’s structural context. Given that the combination of knowledge and statistical methods can often be unpredictable, a central focus of this thesis is on understanding the effects of incorporating linguistic knowledge into word representation learning. Through the use of intrinsic (geometric characteristics) and extrinsic (performance on downstream tasks) evaluation metrics, I aim to measure the specific influence that the injected knowledge can have on different aspects of the informational composition of word embeddings
Prosody generation for text-to-speech synthesis
The absence of convincing intonation makes current parametric speech
synthesis systems sound dull and lifeless, even when trained on expressive
speech data. Typically, these systems use regression techniques to predict the
fundamental frequency (F0) frame-by-frame. This approach leads to overlysmooth
pitch contours and fails to construct an appropriate prosodic structure
across the full utterance. In order to capture and reproduce larger-scale
pitch patterns, we propose a template-based approach for automatic F0 generation,
where per-syllable pitch-contour templates (from a small, automatically
learned set) are predicted by a recurrent neural network (RNN). The use of
syllable templates mitigates the over-smoothing problem and is able to reproduce
pitch patterns observed in the data. The use of an RNN, paired with connectionist
temporal classification (CTC), enables the prediction of structure in
the pitch contour spanning the entire utterance. This novel F0 prediction system
is used alongside separate LSTMs for predicting phone durations and the
other acoustic features, to construct a complete text-to-speech system. Later,
we investigate the benefits of including long-range dependencies in duration
prediction at frame-level using uni-directional recurrent neural networks.
Since prosody is a supra-segmental property, we consider an alternate approach
to intonation generation which exploits long-term dependencies of
F0 by effective modelling of linguistic features using recurrent neural networks.
For this purpose, we propose a hierarchical encoder-decoder and
multi-resolution parallel encoder where the encoder takes word and higher
level linguistic features at the input and upsamples them to phone-level
through a series of hidden layers and is integrated into a Hybrid system which
is then submitted to Blizzard challenge workshop. We then highlight some of
the issues in current approaches and a plan for future directions of investigation
is outlined along with on-going work
Transformer Models for Machine Translation and Streaming Automatic Speech Recognition
[ES] El procesamiento del lenguaje natural (NLP) es un conjunto de problemas
computacionales con aplicaciones de máxima relevancia, que junto con otras
tecnologÃas informáticas se ha beneficiado de la revolución que ha significado
el aprendizaje profundo. Esta tesis se centra en dos problemas fundamentales
para el NLP: la traducción automática (MT) y el reconocimiento automático
del habla o transcripción automática (ASR); asà como en una arquitectura
neuronal profunda, el Transformer, que pondremos en práctica para mejorar
las soluciones de MT y ASR en algunas de sus aplicaciones.
El ASR y MT pueden servir para obtener textos multilingües de alta calidad a
un coste razonable para una diversidad de contenidos audiovisuales. Concre-
tamente, esta tesis aborda problemas como el de traducción de noticias o el de
subtitulación automática de televisión. El ASR y MT también se pueden com-
binar entre sÃ, generando automáticamente subtÃtulos traducidos, o con otras
soluciones de NLP: resumen de textos para producir resúmenes de discursos, o
sÃntesis del habla para crear doblajes automáticos. Estas aplicaciones quedan
fuera del alcance de esta tesis pero pueden aprovechar las contribuciones que
contiene, en la meduda que ayudan a mejorar el rendimiento de los sistemas
automáticos de los que dependen.
Esta tesis contiene una aplicación de la arquitectura Transformer al MT tal y
como fue concebida, mediante la que obtenemos resultados de primer nivel en
traducción de lenguas semejantes. En capÃtulos subsecuentes, esta tesis aborda
la adaptación del Transformer como modelo de lenguaje para sistemas hÃbri-
dos de ASR en vivo. Posteriormente, describe la aplicación de este tipus de
sistemas al caso de uso de subtitulación de televisión, participando en una com-
petición pública de RTVE donde obtenemos la primera posición con un marge
importante. También demostramos que la mejora se debe principalmenta a la
tecnologÃa desarrollada y no tanto a la parte de los datos.[CA] El processament del llenguage natural (NLP) és un conjunt de problemes com-
putacionals amb aplicacions de mà xima rellevà ncia, que juntament amb al-
tres tecnologies informà tiques s'ha beneficiat de la revolució que ha significat
l'impacte de l'aprenentatge profund. Aquesta tesi se centra en dos problemes
fonamentals per al NLP: la traducció automà tica (MT) i el reconeixement
automà tic de la parla o transcripció automà tica (ASR); aixà com en una ar-
quitectura neuronal profunda, el Transformer, que posarem en prà ctica per a
millorar les solucions de MT i ASR en algunes de les seues aplicacions.
l'ASR i MT poden servir per obtindre textos multilingües d'alta qualitat a un
cost raonable per a un gran ventall de continguts audiovisuals. Concretament,
aquesta tesi aborda problemes com el de traducció de notÃcies o el de subtitu-
lació automà tica de televisió. l'ASR i MT també es poden combinar entre ells,
generant automà ticament subtÃtols traduïts, o amb altres solucions de NLP:
amb resum de textos per produir resums de discursos, o amb sÃntesi de la parla
per crear doblatges automà tics. Aquestes altres aplicacions es troben fora de
l'abast d'aquesta tesi però poden aprofitar les contribucions que conté, en la
mesura que ajuden a millorar els resultats dels sistemes automà tics dels quals
depenen.
Aquesta tesi conté una aplicació de l'arquitectura Transformer al MT tal com
va ser concebuda, mitjançant la qual obtenim resultats de primer nivell en
traducció de llengües semblants. En capÃtols subseqüents, aquesta tesi aborda
l'adaptació del Transformer com a model de llenguatge per a sistemes hÃbrids
d'ASR en viu. Posteriorment, descriu l'aplicació d'aquest tipus de sistemes al
cas d'ús de subtitulació de continguts televisius, participant en una competició
pública de RTVE on obtenim la primera posició amb un marge significant.
També demostrem que la millora es deu principalment a la tecnologia desen-
volupada i no tant a la part de les dades[EN] Natural language processing (NLP) is a set of fundamental computing prob-
lems with immense applicability, as language is the natural communication
vehicle for people. NLP, along with many other computer technologies, has
been revolutionized in recent years by the impact of deep learning. This thesis
is centered around two keystone problems for NLP: machine translation (MT)
and automatic speech recognition (ASR); and a common deep neural architec-
ture, the Transformer, that is leveraged to improve the technical solutions for
some MT and ASR applications.
ASR and MT can be utilized to produce cost-effective, high-quality multilin-
gual texts for a wide array of media. Particular applications pursued in this
thesis are that of news translation or that of automatic live captioning of tele-
vision broadcasts. ASR and MT can also be combined with each other, for
instance generating automatic translated subtitles from audio, or augmented
with other NLP solutions: text summarization to produce a summary of a
speech, or speech synthesis to create an automatic translated dubbing, for in-
stance. These other applications fall out of the scope of this thesis, but can
profit from the contributions that it contains, as they help to improve the
performance of the automatic systems on which they depend.
This thesis contains an application of the Transformer architecture to MT as it
was originally conceived, achieving state-of-the-art results in similar language
translation. In successive chapters, this thesis covers the adaptation of the
Transformer as a language model for streaming hybrid ASR systems. After-
wards, it describes how we applied the developed technology for a specific use
case in television captioning by participating in a competitive challenge and
achieving the first position by a large margin. We also show that the gains
came mostly from the improvement in technology capabilities over two years
including that of the Transformer language model adapted for streaming, and
the data component was minor.Baquero Arnal, P. (2023). Transformer Models for Machine Translation and Streaming Automatic Speech Recognition [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/19368
- …