Search CORE

51,326 research outputs found

Enriching Rare Word Representations in Neural Language Models by Embedding Matrix Augmentation

Author: Chng Eng Siong
Khassanov Yerbolat
Pham Van Tung
Xu Haihua
Zeng Zhiping
Publication venue: 'International Speech Communication Association'
Publication date: 31/07/2019
Field of study

The neural language models (NLM) achieve strong generalization capability by learning the dense representation of words and using them to estimate probability distribution function. However, learning the representation of rare words is a challenging problem causing the NLM to produce unreliable probability estimates. To address this problem, we propose a method to enrich representations of rare words in pre-trained NLM and consequently improve its probability estimation performance. The proposed method augments the word embedding matrices of pre-trained NLM while keeping other parameters unchanged. Specifically, our method updates the embedding vectors of rare words using embedding vectors of other semantically and syntactically similar words. To evaluate the proposed method, we enrich the rare street names in the pre-trained NLM and use it to rescore 100-best hypotheses output from the Singapore English speech recognition system. The enriched NLM reduces the word error rate by 6% relative and improves the recognition accuracy of the rare words by 16% absolute as compared to the baseline NLM.Comment: 5 pages, 2 figures, accepted to INTERSPEECH 201

arXiv.org e-Print Archive

Crossref

Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling

Author: Blunsom Phil
Dyer Chris
Kawakami Kazuya
Publication venue
Publication date: 01/01/2017
Field of study

Fixed-vocabulary language models fail to account for one of the most characteristic statistical facts of natural language: the frequent creation and reuse of new word types. Although character-level language models offer a partial solution in that they can create word types not attested in the training corpus, they do not capture the "bursty" distribution of such words. In this paper, we augment a hierarchical LSTM language model that generates sequences of word tokens character by character with a caching mechanism that learns to reuse previously generated words. To validate our model we construct a new open-vocabulary language modeling corpus (the Multilingual Wikipedia Corpus, MWC) from comparable Wikipedia articles in 7 typologically diverse languages and demonstrate the effectiveness of our model across this range of languages.Comment: ACL 201

arXiv.org e-Print Archive

Crossref

Skip-Thought Vectors

Author: Fidler Sanja
Kiros Ryan
Salakhutdinov Ruslan
Torralba Antonio
Urtasun Raquel
Zemel Richard S.
Zhu Yukun
Publication venue
Publication date: 22/06/2015
Field of study

We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice. We will make our encoder publicly available.Comment: 11 page

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT