232 research outputs found
Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation
We introduce a model for constructing vector representations of words by
composing characters using bidirectional LSTMs. Relative to traditional word
representation models that have independent vectors for each word type, our
model requires only a single vector per character type and a fixed set of
parameters for the compositional model. Despite the compactness of this model
and, more importantly, the arbitrary nature of the form-function relationship
in language, our "composed" word representations yield state-of-the-art results
in language modeling and part-of-speech tagging. Benefits over traditional
baselines are particularly pronounced in morphologically rich languages (e.g.,
Turkish)
Mimicking Word Embeddings using Subword RNNs
Word embeddings improve generalization over lexical features by placing each
word in a lower-dimensional space, using distributional information obtained
from unlabeled data. However, the effectiveness of word embeddings for
downstream NLP tasks is limited by out-of-vocabulary (OOV) words, for which
embeddings do not exist. In this paper, we present MIMICK, an approach to
generating OOV word embeddings compositionally, by learning a function from
spellings to distributional embeddings. Unlike prior work, MIMICK does not
require re-training on the original word embedding corpus; instead, learning is
performed at the type level. Intrinsic and extrinsic evaluations demonstrate
the power of this simple approach. On 23 languages, MIMICK improves performance
over a word-based baseline for tagging part-of-speech and morphosyntactic
attributes. It is competitive with (and complementary to) a supervised
character-based model in low-resource settings.Comment: EMNLP 201
Character Composition Model with Convolutional Neural Networks for Dependency Parsing on Morphologically Rich Languages
We present a transition-based dependency parser that uses a convolutional
neural network to compose word representations from characters. The character
composition model shows great improvement over the word-lookup model,
especially for parsing agglutinative languages. These improvements are even
better than using pre-trained word embeddings from extra data. On the SPMRL
data sets, our system outperforms the previous best greedy parser (Ballesteros
et al., 2015) by a margin of 3% on average.Comment: Accepted in ACL 2017 (Short
A Syllable-based Technique for Word Embeddings of Korean Words
Word embedding has become a fundamental component to many NLP tasks such as
named entity recognition and machine translation. However, popular models that
learn such embeddings are unaware of the morphology of words, so it is not
directly applicable to highly agglutinative languages such as Korean. We
propose a syllable-based learning model for Korean using a convolutional neural
network, in which word representation is composed of trained syllable vectors.
Our model successfully produces morphologically meaningful representation of
Korean words compared to the original Skip-gram embeddings. The results also
show that it is quite robust to the Out-of-Vocabulary problem.Comment: 5 pages, 3 figures, 1 table. Accepted for EMNLP 2017 Workshop - The
1st Workshop on Subword and Character level models in NLP (SCLeM
A Sub-Character Architecture for Korean Language Processing
We introduce a novel sub-character architecture that exploits a unique
compositional structure of the Korean language. Our method decomposes each
character into a small set of primitive phonetic units called jamo letters from
which character- and word-level representations are induced. The jamo letters
divulge syntactic and semantic information that is difficult to access with
conventional character-level units. They greatly alleviate the data sparsity
problem, reducing the observation space to 1.6% of the original while
increasing accuracy in our experiments. We apply our architecture to dependency
parsing and achieve dramatic improvement over strong lexical baselines.Comment: EMNLP 201
Compositional Morphology for Word Representations and Language Modelling
This paper presents a scalable method for integrating compositional
morphological representations into a vector-based probabilistic language model.
Our approach is evaluated in the context of log-bilinear language models,
rendered suitably efficient for implementation inside a machine translation
decoder by factoring the vocabulary. We perform both intrinsic and extrinsic
evaluations, presenting results on a range of languages which demonstrate that
our model learns morphological representations that both perform well on word
similarity tasks and lead to substantial reductions in perplexity. When used
for translation into morphologically rich languages with large vocabularies,
our models obtain improvements of up to 1.2 BLEU points relative to a baseline
system using back-off n-gram models.Comment: Proceedings of the 31st International Conference on Machine Learning
(ICML
- …