1,492 research outputs found
BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages
We present BPEmb, a collection of pre-trained subword unit embeddings in 275
languages, based on Byte-Pair Encoding (BPE). In an evaluation using
fine-grained entity typing as testbed, BPEmb performs competitively, and for
some languages bet- ter than alternative subword approaches, while requiring
vastly fewer resources and no tokenization. BPEmb is available at
https://github.com/bheinzerling/bpem
Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance
We introduce a new measure of distance between languages based on word
embedding, called word embedding language divergence (WELD). WELD is defined as
divergence between unified similarity distribution of words between languages.
Using such a measure, we perform language comparison for fifty natural
languages and twelve genetic languages. Our natural language dataset is a
collection of sentence-aligned parallel corpora from bible translations for
fifty languages spanning a variety of language families. Although we use
parallel corpora, which guarantees having the same content in all languages,
interestingly in many cases languages within the same family cluster together.
In addition to natural languages, we perform language comparison for the coding
regions in the genomes of 12 different organisms (4 plants, 6 animals, and two
human subjects). Our result confirms a significant high-level difference in the
genetic language model of humans/animals versus plants. The proposed method is
a step toward defining a quantitative measure of similarity between languages,
with applications in languages classification, genre identification, dialect
identification, and evaluation of translations
Hierarchical Character-Word Models for Language Identification
Social media messages' brevity and unconventional spelling pose a challenge
to language identification. We introduce a hierarchical model that learns
character and contextualized word-level representations for language
identification. Our method performs well against strong base- lines, and can
also reveal code-switching
MiLMo:Minority Multilingual Pre-trained Language Model
Pre-trained language models are trained on large-scale unsupervised data, and
they can fine-turn the model only on small-scale labeled datasets, and achieve
good results. Multilingual pre-trained language models can be trained on
multiple languages, and the model can understand multiple languages at the same
time. At present, the search on pre-trained models mainly focuses on rich
resources, while there is relatively little research on low-resource languages
such as minority languages, and the public multilingual pre-trained language
model can not work well for minority languages. Therefore, this paper
constructs a multilingual pre-trained model named MiLMo that performs better on
minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and
Korean. To solve the problem of scarcity of datasets on minority languages and
verify the effectiveness of the MiLMo model, this paper constructs a minority
multilingual text classification dataset named MiTC, and trains a word2vec
model for each language. By comparing the word2vec model and the pre-trained
model in the text classification task, this paper provides an optimal scheme
for the downstream task research of minority languages. The final experimental
results show that the performance of the pre-trained model is better than that
of the word2vec model, and it has achieved the best results in minority
multilingual text classification. The multilingual pre-trained model MiLMo,
multilingual word2vec model and multilingual text classification dataset MiTC
are published on http://milmo.cmli-nlp.com/
First Attempt at Building Parallel Corpora for Machine Translation of Northeast India's Very Low-Resource Languages
This paper presents the creation of initial bilingual corpora for thirteen
very low-resource languages of India, all from Northeast India. It also
presents the results of initial translation efforts in these languages. It
creates the first-ever parallel corpora for these languages and provides
initial benchmark neural machine translation results for these languages. We
intend to extend these corpora to include a large number of low-resource Indian
languages and integrate the effort with our prior work with African and
American-Indian languages to create corpora covering a large number of
languages from across the world.Comment: Accepted to ICON 202
- …