227 research outputs found

    Comparative Analysis of Word Embeddings for Capturing Word Similarities

    Full text link
    Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.Comment: Part of the 6th International Conference on Natural Language Processing (NATP 2020

    Lightweight Adaptation of Neural Language Models via Subspace Embedding

    Full text link
    Traditional neural word embeddings are usually dependent on a richer diversity of vocabulary. However, the language models recline to cover major vocabularies via the word embedding parameters, in particular, for multilingual language models that generally cover a significant part of their overall learning parameters. In this work, we present a new compact embedding structure to reduce the memory footprint of the pre-trained language models with a sacrifice of up to 4% absolute accuracy. The embeddings vectors reconstruction follows a set of subspace embeddings and an assignment procedure via the contextual relationship among tokens from pre-trained language models. The subspace embedding structure calibrates to masked language models, to evaluate our compact embedding structure on similarity and textual entailment tasks, sentence and paraphrase tasks. Our experimental evaluation shows that the subspace embeddings achieve compression rates beyond 99.8% in comparison with the original embeddings for the language models on XNLI and GLUE benchmark suites.Comment: 5 pages, Accepted as a Main Conference Short Paper at CIKM 202

    Subproduct systems and Cartesian systems; new results on factorial languages and their relations with other areas

    Get PDF
    We point out that a sequence of natural numbers is the dimension sequence of a subproduct system if and only if it is the cardinality sequence of a word system (or factorial language). Determining such sequences is, therefore, reduced to a purely combinatorial problem in the combinatorics of words. A corresponding (and equivalent) result for graded algebras has been known in abstract algebra, but this connection with pure combinatorics has not yet been noticed by the product systems community. We also introduce Cartesian systems, which can be seen either as a set theoretic version of subproduct systems or an abstract version of word systems. Applying this, we provide several new results on the cardinality sequences of word systems and the dimension sequences of subproduct systems.Comment: New title; added references; to appear in Journal of Stochastic Analysi
    corecore