Search CORE

322 research outputs found

ParaPhraser: Russian paraphrase corpus and shared task

Author: Pivovarova Lidia
Pronoza Anton
Pronoza Ekaterina
Yagunova Elena
Publication venue: Springer
Publication date: 01/01/2017
Field of study

The paper describes the results of the First Russian Paraphrase Detection Shared Task held in St.-Petersburg, Russia, in October 2016. Research in the area of paraphrase extraction, detection and generation has been successfully developing for a long time while there has been only a recent surge of interest towards the problem in the Russian community of computational linguistics. We try to overcome this gap by introducing the project ParaPhraser.ru dedicated to the collection of Russian paraphrase corpus and organizing a Paraphrase Detection Shared Task, which uses the corpus as the training data. The participants of the task applied a wide variety of techniques to the problem of paraphrase detection, from rule-based approaches to deep learning, and results of the task reflect the following tendencies: the best scores are obtained by the strategy of using traditional classifiers combined with fine-grained linguistic features, however, complex neural networks, shallow methods and purely technical methods also demonstrate competitive results.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

TatWordNet: A Linguistic Linked Open Data-Integrated WordNet Resource for Tatar

Author: Galieva Alfiya
Ilvovsky Dmitry
Kirillovich Alexander
Loukachevitch Natalia
Nevzorova Olga
Shaekhov Marat
Publication venue: OASIcs - OpenAccess Series in Informatics. 3rd Conference on Language, Data and Knowledge (LDK 2021)
Publication date: 01/01/2021
Field of study

We present the first release of TatWordNet (http://wordnet.tatar), a wordnet resource for Tatar. TatWordNet has been constructed by the combination of the expand and the merge approaches. The synsets of TatWordNet have been compiled by: (i) the automatic conversion of concepts of TatThes, a socio-political Tatar; (ii) semi-automatic translation of synsets of RuWordNet, a wordnet resource for Russian with the followed manual verification and correction; (iii) manual translation of base RuWordNet synsets; (iv) and manual translation of the all hypernyms of the previously translated RuWordNet synsets. The currents version of TatWordNet contains 18,583 synsets, 36,540 lexical entries and 49,525 senses. The resource has been published to the Linguistic Linked Open Data cloud and interlinked with the Global WordNet Grid

Dagstuhl Research Online Publication Server

ArAutoSenti: Automatic annotation and new tendencies for sentiment classification of Arabic messages

Author: Azouaou Faical
Chiclana Francisco
Guellil Imane
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/08/2020
Field of study

The file attached to this record is the author's final peer reviewed version.A corpus-based sentiment analysis approach for messages written in Arabic and its dialects is presented and implemented. The originality of this approach resides in the automation construction of the annotated sentiment corpus, which relies mainly on a sentiment lexicon that is also constructed automatically. For the classification step, shallow and deep classifiers are used with features being extracted applying word embedding models. For the validation of the constructed corpus, we proceed with a manual reviewing and it was found that 85.17% were correctly annotated. This approach is applied on the under-resourced Algerian dialect and the approach is tested on two external test corpora presented in the literature. The obtained results are very encouraging with an F1-score that is up to 88% (on the first test corpus) and up to 81% (on the second test corpus). These results respectively represent a 20% and a 6% improvement, respectively, when compared with existing work in the research literature

De Montfort University Open Research Archive

MS-TR: A Morphologically Enriched Sentiment Treebank and Recursive Deep Models for Compositional Semantics in Turkish

Author: Koç Ebubekir
Seçer Aydın
Zeybek Sultan
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2021
Field of study

Recursive Deep Models have been used as powerful models to learn compositional representations of text for many natural language processing tasks. However, they require structured input (i.e. sentiment treebank) to encode sentences based on their tree-based structure to enable them to learn latent semantics of words using recursive composition functions. In this paper, we present our contributions and efforts for the Turkish Sentiment Treebank construction. We introduce MS-TR, a Morphologically Enriched Sentiment Treebank, which was implemented for training Recursive Deep Models to address compositional sentiment analysis for Turkish, which is one of the well-known Morphologically Rich Language (MRL). We propose a semi-supervised automatic annotation, as a distantsupervision approach, using morphological features of words to infer the polarity of the inner nodes of MS-TR as positive and negative. The proposed annotation model has four different annotation levels: morph-level, stem-level, token-level, and review-level. Each annotation level’s contribution was tested using three different domain datasets, including product reviews, movie reviews, and the Turkish Natural Corpus essays. Comparative results were obtained with the Recursive Neural Tensor Networks (RNTN) model which is operated over MS-TR, and conventional machine learning methods. Experiments proved that RNTN outperformed the baseline methods and achieved much better accuracy results compared to the baseline methods, which cannot accurately capture the aggregated sentiment information

DSpace@FSM Vakif University

Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

Author: EHRMANN MAUD
TURCHI MARCO
Publication venue: Centro de Innovación y Desarrollo Tecnológico en Cómputo, Instituto Politécnico Nacional, Mexico
Publication date: 09/08/2011
Field of study

Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

JRC Publications Repository

Cross-lingual sentiment analysis with machine translation:utility of training corpora and sentiment lexica

Author: Demirtas E.
Publication venue
Publication date: 31/10/2013
Field of study

Pure OAI Repository

D3.8 Lexical-semantic analytics for NLP

Author: Campagnano Cesare
Costa Rute
de Does Jesse
Dobrovoljc Kaja
Frontini Francesca
Gantar Polona
Kallas Jelena
Koppel Kristina
Krek Simon
Langemets Margit
Martelli Federico
Maru Marco
Munda Tina
Navigli Roberto
Nimb Sanni
Olsen Sussi
Quochi Valeria
Salgado Ana de Castro
Tempelaars Rob
Tiberius Carole
Ureña-Ruiz Rafael-J.
Velardi Paola
Čibej Jaka
Publication venue: ELEXIS - European Lexicographic Infrastructure
Publication date: 01/01/2022
Field of study

UIDB/03213/2020 UIDP/03213/2020The present document illustrates the work carried out in task 3.3 (work package 3) of ELEXIS project focused on lexical-semantic analytics for Natural Language Processing (NLP). This task aims at computing analytics for lexical-semantic information such as words, senses and domains in the available resources, investigating their role in NLP applications. Specifically, this task concentrates on three research directions, namely i) sense clustering, in which grouping senses based on their semantic similarity improves the performance of NLP tasks such as Word Sense Disambiguation (WSD), ii) domain labeling of text, in which the lexicographic resources made available by the ELEXIS project for research purposes allow better performances to be achieved, and finally iii) analysing the diachronic distribution of senses, for which a software package is made available.publishersversionpublishe

Repositório da Universidade Nova de Lisboa

Proceedings of the 13th Linguistic Annotation Workshop, August 1, 2019, Florence, Italy

Author: Friedrich Annemarie
Hoek Jet
Zeyrek Deniz
Publication venue
Publication date: 07/07/2023
Field of study

OPUS Augsburg