8 research outputs found

    The TALP–UPC Spanish–English WMT biomedical task: bilingual embeddings and char-based neural language model rescoring in a phrase-based system

    Get PDF
    This paper describes the TALP–UPC system in the Spanish–English WMT 2016 biomedical shared task. Our system is a standard phrase-based system enhanced with vocabulary expansion using bilingual word embeddings and a characterbased neural language model with rescoring. The former focuses on resolving outof- vocabulary words, while the latter enhances the fluency of the system. The two modules progressively improve the final translation as measured by a combination of several lexical metrics.Postprint (published version

    Towards Automatic Generation of Shareable Synthetic Clinical Notes Using Neural Language Models

    Full text link
    Large-scale clinical data is invaluable to driving many computational scientific advances today. However, understandable concerns regarding patient privacy hinder the open dissemination of such data and give rise to suboptimal siloed research. De-identification methods attempt to address these concerns but were shown to be susceptible to adversarial attacks. In this work, we focus on the vast amounts of unstructured natural language data stored in clinical notes and propose to automatically generate synthetic clinical notes that are more amenable to sharing using generative models trained on real de-identified records. To evaluate the merit of such notes, we measure both their privacy preservation properties as well as utility in training clinical NLP models. Experiments using neural language models yield notes whose utility is close to that of the real ones in some clinical NLP tasks, yet leave ample room for future improvements.Comment: Clinical NLP Workshop 201

    Numeracy for Language Models: Evaluating and Improving their Ability to Predict Numbers

    Get PDF
    Numeracy is the ability to understand and work with numbers. It is a necessary skill for composing and understanding documents in clinical, scientific, and other technical domains. In this paper, we explore different strategies for modelling numerals with language models, such as memorisation and digit-by-digit composition, and propose a novel neural architecture that uses a continuous probability density function to model numerals from an open vocabulary. Our evaluation on clinical and scientific datasets shows that using hierarchical models to distinguish numerals from words improves a perplexity metric on the subset of numerals by 2 and 4 orders of magnitude, respectively, over non-hierarchical models. A combination of strategies can further improve perplexity. Our continuous probability density function model reduces mean absolute percentage errors by 18% and 54% in comparison to the second best strategy for each dataset, respectively.Comment: accepted at ACL 201

    From feature to paradigm: deep learning in machine translation

    No full text
    In the last years, deep learning algorithms have highly revolutionized several areas including speech, image and natural language processing. The specific field of Machine Translation (MT) has not remained invariant. Integration of deep learning in MT varies from re-modeling existing features into standard statistical systems to the development of a new architecture. Among the different neural networks, research works use feed- forward neural networks, recurrent neural networks and the encoder-decoder schema. These architectures are able to tackle challenges as having low-resources or morphology variations. This manuscript focuses on describing how these neural networks have been integrated to enhance different aspects and models from statistical MT, including language modeling, word alignment, translation, reordering, and rescoring. Then, we report the new neural MT approach together with a description of the foundational related works and recent approaches on using subword, characters and training with multilingual languages, among others. Finally, we include an analysis of the corresponding challenges and future work in using deep learning in MTPostprint (author's final draft

    Numeracy for language models: Evaluating and improving their ability to predict numbers

    Get PDF
    Numeracy is the ability to understand and work with numbers. It is a necessary skill for composing and understanding documents in clinical, scientific, and other technical domains. In this paper, we explore different strategies for modelling numerals with language models, such as memorisation and digit-by-digit composition, and propose a novel neural architecture that uses a continuous probability density function to model numerals from an open vocabulary. Our evaluation on clinical and scientific datasets shows that using hierarchical models to distinguish numerals from words improves a perplexity metric on the subset of numerals by 2 and 4 orders of magnitude, respectively, over non-hierarchical models. A combination of strategies can further improve perplexity. Our continuous probability density function model reduces mean absolute percentage errors by 18% and 54% in comparison to the second best strategy for each dataset, respectively

    Distributional initialization of neural networks

    Get PDF
    In Natural Language Processing (NLP), together with speech, text is one of the main sources of information. Computational systems that process raw text need to perform a transformation of text input into machine-readable format. Final performance of the NLP systems depends on the quality of these input representations, that is why the main objective for representation learning is to keep and highlight important features of the input tokens (characters, words, phrases, etc.). Traditionally, for Neural Networks (NNs) such input representations are one-hot vectors, where each word is represented with a vector of all-but-one zeros, with value 1 on the position that corresponds to the index of the word in the vocabulary. Such a representation only helps to differentiate words, but does not contain any usable information about relations between them. Word representations that are learned by NNs - word embeddings - are then arranged in a matrix, where each row corresponds to a particular word in a vocabulary and is retrieved by multiplication of the corresponding one-hot vector and the embedding matrix. These word embeddings are initialized randomly, and during training adjust their values to capture the contextual semantic information with respect to the training objective. When a word is frequent, it is seen often during training and its representation is updated frequently; for the same reason embeddings for rare words experience much less updates. This makes it difficult for NNs to learn good word embeddings for words that occur just several times in a corpus. In this work, we propose a method to improve quality of word embeddings of rare words. The main idea is to initialize a NN that learns embeddings with sparse distributional vectors that are precomputed for rare words from a given corpus. We introduce and investigate several methods for building such distributional representations: with different ways to combine one-hot representations of frequent and distributional representations of rare words, different similarity functions between distributional vectors, different normalization approaches applied to the representations in order to control the input signals' amplitude. We evaluate the performance of our proposed models on two tasks. On a word similarity judgment task, the embeddings of words are used to compute similarity scores between two words in given pairs; then these similarity scores are compared with human ratings. With use of the same NN architecture, word embeddings that are trained using distributional initialization show significantly better performance than word embeddings trained with traditional one-hot initialization. On language modeling task, where models compete in predicting probability of a given sequence of words, models with distributional initialization show minor improvements over models with one-hot initialization. We also study a very popular word2vec tool (Mikolov et al., 2013a) that is used to obtain word embeddings without supervision. The main question we ask is how much the quality of learned word embeddings depends on the initial random seed. The obtained results suggest that training with word2vec is stable and reliable.Text ist neben Sprache eine der Hauptinformationsquellen in der natürlichen Sprachverarbeitung. Um Rohtexte zu verarbeiten, müssen Computer die Texteingabe zunächst in maschinenlesbares Format umwandeln. Von der Qualität dieser Eingaberepräsentation hängt die finale Leistung von Sprachverarbeitungssystemen ab. Hauptziele des Repräsentationslernens sind daher der Erhalt und die Hervorhebung wichtiger Eigenschaften der Eingabe (Buchstaben, Wörter, Phrasen, etc.). Traditionelle Eingaberepräsentationen für neuronale Netze (NNs) sind sogenannte 1-aus-N Vektoren, die jedes Wort als einen Vektor darstellen, der nur aus Nullen und einer Eins an jener Position besteht, die dem Index des Wortes im Vokabular entspricht. Solche Repräsentationen können zwar Wörter differenzieren, enthalten aber keine weiteren Informationen, z.B. bezüglich Relationen zwischen Wörtern. Wortrepräsentationen können andererseits auch von NNs gelernt werden. Diese sogenannten Worteinbettungen werden meist in einer Matrix angeordnet, in der jede Zeile einem bestimmten Wort in einem Vokabular entspricht. Für das Training kann durch Multiplikation des zugehörigen 1-aus-N Vektors und der Einbettungsmatrix auf sie zugegriffen werden. Worteinbettungen werden meist zufällig initialisiert und während des Trainings so angepasst, dass sie kontextabhängige semantische Informationen bezüglich des Trainingsziels widerspiegeln. Da häufige Wörter oft während des Trainings gesehen werden, werden ihre Repräsentationen mehrfach aktualisiert. Aus demselben Grund werden Einbettungen seltener Wörter weitaus weniger angepasst. Dies erschwert es NNs, gute Worteinbettungen für Wörter zu lernen, die nur wenige Male in einem Korpus auftreten. In dieser Arbeit schlagen wir eine Methode vor, um die Qualität von Worteinbettungen für seltene Wörter zu verbessern. Dazu wird ein NN, das Einbettungen lernt, mit dünnbesetzten verteilten Vektoren initialisiert, die für seltene Wörter aus einem gegebenen Korpus vorberechnet werden. Wir führen verschiedene Methoden ein, solche verteilten Initialisierungsvektoren zu erstellen und untersuchen sie: Wir betrachten unterschiedliche Möglichkeiten, 1-aus-N Repräsentationen für häufige Wörter und verteilte Vektoren für seltene Wörter zu kombinieren, vergleichen Ähnlichkeitsfunktionen für verteilte Vektoren und stellen Normalisierungsansätze vor, die auf die Repräsentation angewandt werden können, um die Amplitude des Eingabesignals zu kontrollieren. Zur Bewertung unserer vorgeschlagenen Modelle betrachten wir zwei Aufgaben. Die erste Aufgabe ist die Beurteilung von Wortähnlichkeiten. Dabei werden Worteinbettungen verwendet, um Ähnlichkeiten zwischen den Wörtern eines gegebenen Wortpaares zu berechnen. Diese werden dann mit menschlichen Bewertungen verglichen. Bei Verwendung der gleichen NN Architektur zeigen Worteinbettungen, die mit verteilten Initialisierungen trainiert wurden, signifikant bessere Leistungen als Worteinbettungen, die mit traditionellen 1-aus-N Initialisierungen trainiert wurden. Die zweite Aufgabe ist Sprachmodellierung, das heißt, die Vorhersage der Wahrscheinlichkeit einer gegebenen Wortsequenz. Dabei zeigen Modelle mit verteilter Initialisierung geringfügige Verbesserungen gegenüber Modellen mit 1-aus-N Initialisierungen. Wir betrachten außerdem das weit verbreitete word2vec (Wort-zu-Vektor) Programm (Mikolov et al., 2013a), das verwendet wird, um unüberwacht Worteinbettungen zu lernen. Die Hauptfrage, die wir untersuchen, ist, wie stark die Qualität der gelernten Worteinbettungen von dem Startwert der Zufallszahlen abhängt. Die erhaltenen Ergebnisse deuten darauf hin, dass das Training mit word2vec stabil und zuverlässig ist
    corecore