15 research outputs found
Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling
Fixed-vocabulary language models fail to account for one of the most
characteristic statistical facts of natural language: the frequent creation and
reuse of new word types. Although character-level language models offer a
partial solution in that they can create word types not attested in the
training corpus, they do not capture the "bursty" distribution of such words.
In this paper, we augment a hierarchical LSTM language model that generates
sequences of word tokens character by character with a caching mechanism that
learns to reuse previously generated words. To validate our model we construct
a new open-vocabulary language modeling corpus (the Multilingual Wikipedia
Corpus, MWC) from comparable Wikipedia articles in 7 typologically diverse
languages and demonstrate the effectiveness of our model across this range of
languages.Comment: ACL 201
2kenize: Tying Subword Sequences for Chinese Script Conversion
Simplified Chinese to Traditional Chinese character conversion is a common
preprocessing step in Chinese NLP. Despite this, current approaches have poor
performance because they do not take into account that a simplified Chinese
character can correspond to multiple traditional characters. Here, we propose a
model that can disambiguate between mappings and convert between the two
scripts. The model is based on subword segmentation, two language models, as
well as a method for mapping between subword sequences. We further construct
benchmark datasets for topic classification and script conversion. Our proposed
method outperforms previous Chinese Character conversion approaches by 6 points
in accuracy. These results are further confirmed in a downstream application,
where 2kenize is used to convert pretraining dataset for topic classification.
An error analysis reveals that our method's particular strengths are in dealing
with code-mixing and named entities.Comment: Accepted to ACL 202
On the relation between linguistic typology and (limitations of) multilingual language modeling
A key challenge in cross-lingual NLP is developing general language-independent architectures that are equally applicable to any language. However, this ambition is largely hampered by the variation in structural and semantic properties, i.e. the typological profiles of the world's languages. In this work, we analyse the implications of this variation on the language modeling (LM) task. We present a large-scale study of state-of-the art n-gram based and neural language models on 50 typologically diverse languages covering a wide variety of morphological systems. Operating in the full vocabulary LM setup focused on word-level prediction, we demonstrate that a coarse typology of morphological systems is predictive of absolute LM performance. Moreover, fine-grained typological features such as exponence, flexivity, fusion, and inflectional synthesis are borne out to be responsible for the proliferation of low-frequency phenomena which are organically difficult to model by statistical architectures, or for the meaning ambiguity of character n-grams. Our study strongly suggests that these features have to be taken into consideration during the construction of next-level language-agnostic LM architectures, capable of handling morphologically complex languages such as Tamil or Korean.ERC grant Lexica