5 research outputs found

    Exploiting word embeddings for modeling bilexical relations

    Get PDF
    There has been an exponential surge of text data in the recent years. As a consequence, unsupervised methods that make use of this data have been steadily growing in the field of natural language processing (NLP). Word embeddings are low-dimensional vectors obtained using unsupervised techniques on the large unlabelled corpora, where words from the vocabulary are mapped to vectors of real numbers. Word embeddings aim to capture syntactic and semantic properties of words. In NLP, many tasks involve computing the compatibility between lexical items under some linguistic relation. We call this type of relation a bilexical relation. Our thesis defines statistical models for bilexical relations that centrally make use of word embeddings. Our principle aim is that the word embeddings will favor generalization to words not seen during the training of the model. The thesis is structured in four parts. In the first part of this thesis, we present a bilinear model over word embeddings that leverages a small supervised dataset for a binary linguistic relation. Our learning algorithm exploits low-rank bilinear forms and induces a low-dimensional embedding tailored for a target linguistic relation. This results in compressed task-specific embeddings. In the second part of our thesis, we extend our bilinear model to a ternary setting and propose a framework for resolving prepositional phrase attachment ambiguity using word embeddings. Our models perform competitively with state-of-the-art models. In addition, our method obtains significant improvements on out-of-domain tests by simply using word-embeddings induced from source and target domains. In the third part of this thesis, we further extend the bilinear models for expanding vocabulary in the context of statistical phrase-based machine translation. Our model obtains a probabilistic list of possible translations of target language words, given a word in the source language. We do this by projecting pre-trained embeddings into a common subspace using a log-bilinear model. We empirically notice a significant improvement on an out-of-domain test set. In the final part of our thesis, we propose a non-linear model that maps initial word embeddings to task-tuned word embeddings, in the context of a neural network dependency parser. We demonstrate its use for improved dependency parsing, especially for sentences with unseen words. We also show downstream improvements on a sentiment analysis task.En els darrers anys hi ha hagut un sorgiment notable de dades en format textual. Conseqüentment, en el camp del Processament del Llenguatge Natural (NLP, de l'anglès "Natural Language Processing") s'han desenvolupat mètodes no supervistats que fan ús d'aquestes dades. Els anomenats "word embeddings", o embeddings de paraules, són vectors de dimensionalitat baixa que s'obtenen mitjançant tècniques no supervisades aplicades a corpus textuals de grans volums. Com a resultat, cada paraula del diccionari es correspon amb un vector de nombres reals, el propòsit del qual és capturar propietats sintàctiques i semàntiques de la paraula corresponent. Moltes tasques de NLP involucren calcular la compatibilitat entre elements lèxics en l'àmbit d'una relació lingüística. D'aquest tipus de relació en diem relació bilèxica. Aquesta tesi proposa models estadístics per a relacions bilèxiques que fan ús central d'embeddings de paraules, amb l'objectiu de millorar la generalització del model lingüístic a paraules no vistes durant l'entrenament. La tesi s'estructura en quatre parts. A la primera part presentem un model bilineal sobre embeddings de paraules que explota un conjunt petit de dades anotades sobre una relaxió bilèxica. L'algorisme d'aprenentatge treballa amb formes bilineals de poc rang, i indueix embeddings de poca dimensionalitat que estan especialitzats per la relació bilèxica per la qual s'han entrenat. Com a resultat, obtenim embeddings de paraules que corresponen a compressions d'embeddings per a una relació determinada. A la segona part de la tesi proposem una extensió del model bilineal a trilineal, i amb això proposem un nou model per a resoldre ambigüitats de sintagmes preposicionals que usa només embeddings de paraules. En una sèrie d'avaluacións, els nostres models funcionen de manera similar a l'estat de l'art. A més, el nostre mètode obté millores significatives en avaluacions en textos de dominis diferents al d'entrenament, simplement usant embeddings induïts amb textos dels dominis d'entrenament i d'avaluació. A la tercera part d'aquesta tesi proposem una altra extensió dels models bilineals per ampliar la cobertura lèxica en el context de models estadístics de traducció automàtica. El nostre model probabilístic obté, donada una paraula en la llengua d'origen, una llista de possibles traduccions en la llengua de destí. Fem això mitjançant una projecció d'embeddings pre-entrenats a un sub-espai comú, usant un model log-bilineal. Empíricament, observem una millora significativa en avaluacions en dominis diferents al d'entrenament. Finalment, a la quarta part de la tesi proposem un model no lineal que indueix una correspondència entre embeddings inicials i embeddings especialitzats, en el context de tasques d'anàlisi sintàctica de dependències amb models neuronals. Mostrem que aquest mètode millora l'analisi de dependències, especialment en oracions amb paraules no vistes durant l'entrenament. També mostrem millores en un tasca d'anàlisi de sentiment

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Salience Estimation and Faithful Generation: Modeling Methods for Text Summarization and Generation

    Get PDF
    This thesis is focused on a particular text-to-text generation problem, automatic summarization, where the goal is to map a large input text to a much shorter summary text. The research presented aims to both understand and tame existing machine learning models, hopefully paving the way for more reliable text-to-text generation algorithms. Somewhat against the prevailing trends, we eschew end-to-end training of an abstractive summarization model, and instead break down the text summarization problem into its constituent tasks. At a high level, we divide these tasks into two categories: content selection, or “what to say” and content realization, or “how to say it” (McKeown, 1985). Within these categories we propose models and learning algorithms for the problems of salience estimation and faithful generation. Salience estimation, that is, determining the importance of a piece of text relative to some context, falls into a problem of the former category, determining what should be selected for a summary. In particular, we experiment with a variety of popular or novel deep learning models for salience estimation in a single document summarization setting, and design several ablation experiments to gain some insight into which input signals are most important for making predictions. Understanding these signals is critical for designing reliable summarization models. We then consider a more difficult problem of estimating salience in a large document stream, and propose two alternative approaches using classical machine learning techniques from both unsupervised clustering and structured prediction. These models incorporate salience estimates into larger text extraction algorithms that also consider redundancy and previous extraction decisions. Overall, we find that when simple, position based heuristics are available, as in single document news or research summarization, deep learning models of salience often exploit them to make predictions, while ignoring the arguably more important content features of the input. In more demanding environments, like stream summarization, where heuristics are unreliable, more semantically relevant features become key to identifying salience content. In part two, content realization, we assume content selection has already been performed and focus on methods for faithful generation (i.e., ensuring that output text utterances respect the semantics of the input content). Since they can generate very fluent and natural text, deep learning- based natural language generation models are a popular approach to this problem. However, they often omit, misconstrue, or otherwise generate text that is not semantically correct given the input content. In this section, we develop a data augmentation and self-training technique to mitigate this problem. Additionally, we propose a training method for making deep learning-based natural language generation models capable of following a content plan, allowing for more control over the output utterances generated by the model. Under a stress test evaluation protocol, we demonstrate some empirical limits on several neural natural language generation models’ ability to encode and properly realize a content plan. Finally, we conclude with some remarks on future directions for abstractive summarization outside of the end-to-end deep learning paradigm. Our aim here is to suggest avenues for constructing abstractive summarization systems with transparent, controllable, and reliable behavior when it comes to text understanding, compression, and generation. Our hope is that this thesis inspires more research in this direction, and, ultimately, real tools that are broadly useful outside of the natural language processing community
    corecore