227 research outputs found
自然言語符号化における言語横断的な転移学習、圧縮、白色化の効果の分析
Tohoku University博士(情報科学)thesi
Comparative Analysis of Word Embeddings for Capturing Word Similarities
Distributed language representation has become the most widely used technique
for language representation in various natural language processing tasks. Most
of the natural language processing models that are based on deep learning
techniques use already pre-trained distributed word representations, commonly
called word embeddings. Determining the most qualitative word embeddings is of
crucial importance for such models. However, selecting the appropriate word
embeddings is a perplexing task since the projected embedding space is not
intuitive to humans. In this paper, we explore different approaches for
creating distributed word representations. We perform an intrinsic evaluation
of several state-of-the-art word embedding methods. Their performance on
capturing word similarities is analysed with existing benchmark datasets for
word pairs similarities. The research in this paper conducts a correlation
analysis between ground truth word similarities and similarities obtained by
different word embedding methods.Comment: Part of the 6th International Conference on Natural Language
Processing (NATP 2020
Lightweight Adaptation of Neural Language Models via Subspace Embedding
Traditional neural word embeddings are usually dependent on a richer
diversity of vocabulary. However, the language models recline to cover major
vocabularies via the word embedding parameters, in particular, for multilingual
language models that generally cover a significant part of their overall
learning parameters. In this work, we present a new compact embedding structure
to reduce the memory footprint of the pre-trained language models with a
sacrifice of up to 4% absolute accuracy. The embeddings vectors reconstruction
follows a set of subspace embeddings and an assignment procedure via the
contextual relationship among tokens from pre-trained language models. The
subspace embedding structure calibrates to masked language models, to evaluate
our compact embedding structure on similarity and textual entailment tasks,
sentence and paraphrase tasks. Our experimental evaluation shows that the
subspace embeddings achieve compression rates beyond 99.8% in comparison with
the original embeddings for the language models on XNLI and GLUE benchmark
suites.Comment: 5 pages, Accepted as a Main Conference Short Paper at CIKM 202
Subproduct systems and Cartesian systems; new results on factorial languages and their relations with other areas
We point out that a sequence of natural numbers is the dimension sequence of
a subproduct system if and only if it is the cardinality sequence of a word
system (or factorial language). Determining such sequences is, therefore,
reduced to a purely combinatorial problem in the combinatorics of words. A
corresponding (and equivalent) result for graded algebras has been known in
abstract algebra, but this connection with pure combinatorics has not yet been
noticed by the product systems community. We also introduce Cartesian systems,
which can be seen either as a set theoretic version of subproduct systems or an
abstract version of word systems. Applying this, we provide several new results
on the cardinality sequences of word systems and the dimension sequences of
subproduct systems.Comment: New title; added references; to appear in Journal of Stochastic
Analysi
- …