332 research outputs found
Russian word sense induction by clustering averaged word embeddings
The paper reports our participation in the shared task on word sense
induction and disambiguation for the Russian language (RUSSE-2018). Our team
was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th
for the bts-rnc and active-dict datasets (containing mostly polysemous words)
among all 19 participants.
The method we employed was extremely naive. It implied representing contexts
of ambiguous words as averaged word embedding vectors, using off-the-shelf
pre-trained distributional models. Then, these vector representations were
clustered with mainstream clustering techniques, thus producing the groups
corresponding to the ambiguous word senses. As a side result, we show that word
embedding models trained on small but balanced corpora can be superior to those
trained on large but noisy data - not only in intrinsic evaluation, but also in
downstream tasks like word sense induction.Comment: Proceedings of the 24rd International Conference on Computational
Linguistics and Intellectual Technologies (Dialogue-2018
Experiments with Russian to Kazakh sentence alignment
Sentence alignment is the final step in building parallel corpora, which arguably has the greatest impact on the quality of a resulting corpus and the accuracy of machine translation systems that use it for training. However, the quality of sentence alignment itself depends on a number of factors. In this paper we investigate the impact of several data processing techniques on the quality of sentence alignment. We develop and use a number of automatic evaluation metrics, and provide empirical evidence that application of all of the considered data processing techniques yields bitexts with the lowest ratio of noise and the highest ratio of parallel sentences
The TXM Portal Software giving access to Old French Manuscripts Online
Texte intégral en ligne : http://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdfInternational audiencehttp://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdf This paper presents the new TXM software platform giving online access to Old French Text Manuscripts images and tagged transcriptions for concordancing and text mining. This platform is able to import medieval sources encoded in XML according to the TEI Guidelines for linking manuscript images to transcriptions, encode several diplomatic levels of transcription including abbreviations and word level corrections. It includes a sophisticated tokenizer able to deal with TEI tags at different levels of linguistic hierarchy. Words are tagged on the fly during the import process using IMS TreeTagger tool with a specific language model. Synoptic editions displaying side by side manuscript images and text transcriptions are automatically produced during the import process. Texts are organized in a corpus with their own metadata (title, author, date, genre, etc.) and several word properties indexes are produced for the CQP search engine to allow efficient word patterns search to build different type of frequency lists or concordances. For syntactically annotated texts, special indexes are produced for the Tiger Search engine to allow efficient syntactic concordances building. The platform has also been tested on classical Latin, ancient Greek, Old Slavonic and Old Hieroglyphic Egyptian corpora (including various types of encoding and annotations)
Men Are from Mars, Women Are from Venus: Evaluation and Modelling of Verbal Associations
We present a quantitative analysis of human word association pairs and study
the types of relations presented in the associations. We put our main focus on
the correlation between response types and respondent characteristics such as
occupation and gender by contrasting syntagmatic and paradigmatic associations.
Finally, we propose a personalised distributed word association model and show
the importance of incorporating demographic factors into the models commonly
used in natural language processing.Comment: AIST 2017 camera-read
Word Representation Models for Morphologically Rich Languages in Neural Machine Translation
Dealing with the complex word forms in morphologically rich languages is an
open problem in language processing, and is particularly important in
translation. In contrast to most modern neural systems of translation, which
discard the identity for rare words, in this paper we propose several
architectures for learning word representations from character and morpheme
level word decompositions. We incorporate these representations in a novel
machine translation model which jointly learns word alignments and translations
via a hard attention mechanism. Evaluating on translating from several
morphologically rich languages into English, we show consistent improvements
over strong baseline methods, of between 1 and 1.5 BLEU points
- …