Search CORE

29 research outputs found

Results of the Translation Inference Across Dictionaries 2021 Shared Task

Author: Gracia Jorge
Kabashi Besim
Kernerman IIan
Publication venue
Publication date: 01/01/2021
Field of study

The objective of the Translation Inference Across Dictionaries (TIAD) shared task is to explore and compare methods and techniques that infer translations indirectly between language pairs, based on other bilingual/multilingual lexicographic resources. In this forth edition the participating systems were asked to generate new translations automatically among three languages - English, French, Portuguese - based on known indirect translations contained in the Apertium RDF graph. Such evaluation pairs have been the same during the three last TIAD editions. The main novelty this time has been the use of a larger graph as a basis to produce the translations, which is the Apertium RDF v2, and the introduction of improved evaluation metrics. The evaluation of the results was carried out by the organisers against manually compiled language pairs of K Dictionaries. For the first time in the TIAD series, some systems beat the proposed baselines. This paper gives an overall description of the shard task, the evaluation data and methodology, and the systems’ results

Repositorio Universidad de Zaragoza

Translation inference by concept propagation

Author: Chiarcos Christian
Fäth Christian
Schenk Niko
Publication venue
Publication date: 25/04/2023
Field of study

This paper describes our contribution to the Third Shared Task on Translation Inference across Dictionaries (TIAD-2020). We describe an approach on translation inference based on symbolic methods, the propagation of concepts over a graph of interconnected dictionaries: Given a mapping from source language words to lexical concepts (e.g., synsets) as a seed, we use bilingual dictionaries to extrapolate a mapping of pivot and target language words to these lexical concepts. Translation inference is then performed by looking up the lexical concept(s) of a source language word and returning the target language word(s) for which these lexical concepts have the respective highest score. We present two instantiations of this system: One using WordNet synsets as concepts, and one using lexical entries (translations) as concepts. With a threshold of 0, the latter configuration is the second among participant systems in terms of F1 score. We also describe additional evaluation experiments on Apertium data, a comparison with an earlier approach based on embedding projection, and an approach for constrained projection that outperforms the TIAD-2020 vanilla system by a large margin

OPUS Augsburg

A Twitter Corpus and Lexicon for Abusive Speech Detection in Serbian

Author: Joki? Danka
Krstev Cvetana
Stankovi? Ranka
Publication venue: OASIcs - OpenAccess Series in Informatics. 3rd Conference on Language, Data and Knowledge (LDK 2021)
Publication date: 01/01/2021
Field of study

Abusive speech in social media, including profanities, derogatory and hate speech, has reached the level of a pandemic. A system that would be able to detect such texts could help in making the Internet and social media a better and more respectful virtual space. Research and commercial application in this area were so far focused mainly on the English language. This paper presents the work on building AbCoSER, the first corpus of abusive speech in Serbian. The corpus consists of 6,436 manually annotated tweets, out of which 1,416 were labelled as tweets using some kind of abusive speech. Those 1,416 tweets were further sub-classified, for instance to those using vulgar, hate speech, derogatory language, etc. In this paper, we explain the process of data acquisition, annotation, and corpus construction. We also discuss the results of an initial analysis of the annotation quality. Finally, we present an abusive speech lexicon structure and its enrichment with abusive triggers extracted from the AbCoSER dataset

Dagstuhl Research Online Publication Server

Cross-Lingual Link Discovery for Under-Resourced Languages

Author: Ahmadi Sina
Apostol Elena-Simona
Bosque-Gil Julia
Chiarcos Christian
Dojchinovski Milan
Gkirtzou Katerina
Gracia Jorge
Gromann Dagmar
Liebeskind Chaya
Rosner Michael
Serasset Gilles
Truica Ciprian-Octavian
Valūnaitė-Oleškevičienė Giedrė
Publication venue
Publication date: 01/01/2022
Field of study

CC BY-NC 4.0In this paper, we provide an overview of current technologies for cross-lingual link discovery, and we discuss challenges, experiences and prospects of their application to under-resourced languages. We first introduce the goals of cross-lingual linking and associated technologies, and in particular, the role that the Linked Data paradigm (Bizer et al., 2011) applied to language data can play in this context. We define under-resourced languages with a specific focus on languages actively used on the internet, i.e., languages with a digitally versatile speaker community, but limited support in terms of language technology. We argue that languages for which considerable amounts of textual data and (at least) a bilingual word list are available, techniques for cross-lingual linking can be readily applied, and that these enable the implementation of downstream applications for under-resourced languages via the localisation and adaptation of existing technologies and resources

Mykolas Romeris University Institutional Repository

When linguistics meets web technologies. Recent advances in modelling linguistic linked data

Author: Chiarcos Christian
Declerck Thierry
García Elena González-Blanco
Gifu Daniela
Gracia Jorge
Ionov Maxim
Khan Anas Fahad
Labropoulou Penny
Mambrini Francesco (ORCID:0000-0003-0834-7562)
McCrae John P.
Muñoz Salvador Ros
Pagé-Perron Émilie
Passarotti Marco (ORCID:0000-0002-9806-7187)
Truică Ciprian-Octavian
Publication venue: 'IOS Press'
Publication date: 01/01/2022
Field of study

This article provides an up-to-date and comprehensive survey of models (including vocabularies, taxonomies and ontologies) used for representing linguistic linked data (LLD). It focuses on the latest developments in the area and both builds upon and complements previous works covering similar territory. The article begins with an overview of recent trends which have had an impact on linked data models and vocabularies, such as the growing influence of the FAIR guidelines, the funding of several major projects in which LLD is a key component, and the increasing importance of the relationship of the digital humanities with LLD. Next, we give an overview of some of the most well known vocabularies and models in LLD. After this we look at some of the latest developments in community standards and initiatives such as OntoLex-Lemon as well as recent work which has been in carried out in corpora and annotation and LLD including a discussion of the LLD metadata vocabularies META-SHARE and lime and language identifiers. In the following part of the paper we look at work which has been realised in a number of recent projects and which has a significant impact on LLD vocabularies and models

OPUS Augsburg

PubliCatt

Repositorio Universidad de Zaragoza

Lynx: A knowledge-based AI service platform for content processing, enrichment and analysis for the legal domain

Author: Boil Ballesteros P.
Bosque Gil J.
Gomez Diaz E.
Gracia Jorge
Kaltenböck M.
Karampatakis S.
Kernerman I.
Lagzdins A.
Lonke D.
Maganza F.
Martín-Chozas P.
Montiel-Ponsoda E.
Moreno Schneider J.
Navas-Loro M.
Rehm G.
Revenko A.
Rodríguez-Doncel V.
Sageder C.
Verhoeven P.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

The EU-funded project Lynx focuses on the creation of a knowledge graph for the legal domain (Legal Knowledge Graph, LKG) and its use for the semantic processing, analysis and enrichment of documents from the legal domain. This article describes the use cases covered in the project, the entire developed platform and the semantic analysis services that operate on the documents. © 202

Repositorio Universidad de Zaragoza

Linking discourse marker inventories

Author: Chiarcos Christian
Ionov Maxim
Publication venue
Publication date: 24/04/2023
Field of study

The paper describes the first comprehensive edition of machine-readable discourse marker lexicons. Discourse markers such as and, because, but, though or thereafter are essential communicative signals in human conversation, as they indicate how an utterance relates to its communicative context. As much of this information is implicit or expressed differently in different languages, discourse parsing, context-adequate natural language generation and machine translation are considered particularly challenging aspects of Natural Language Processing. Providing this data in machine-readable, standard-compliant form will thus facilitate such technical tasks, and moreover, allow to explore techniques for translation inference to be applied to this particular group of lexical resources that was previously largely neglected in the context of Linguistic Linked (Open) Data

OPUS Augsburg

Linking Discourse Marker Inventories

Author: Chiarcos Christian
Ionov Maxim
Publication venue: OASIcs - OpenAccess Series in Informatics. 3rd Conference on Language, Data and Knowledge (LDK 2021)
Publication date: 01/01/2021
Field of study

Dagstuhl Research Online Publication Server

Modelling frequency and attestations for OntoLex-Lemon

Author: Chiarcos Christian
de Does Jesse
Declerck Thierry
Depuydt Katrien
Fahad Khan Anas
Ionov Maxim
McCrae John Philip
Stolk Sander
Publication venue
Publication date: 24/04/2023
Field of study

The OntoLex vocabulary enjoys increasing popularity as a means of publishing lexical resources with RDF and as Linked Data. The recent publication of a new OntoLex module for lexicography, lexicog, reflects its increasing importance for digital lexicography. However, not all aspects of digital lexicography have been covered to the same extent. In particular, supplementary information drawn from corpora such as frequency information, links to attestations, and collocation data were considered to be beyond the scope of lexicog. Therefore, the OntoLex community has put forward the proposal for a novel module for frequency, attestation and corpus information (FrAC), that not only covers the requirements of digital lexicography, but also accommodates essential data structures for lexical information in natural language processing. This paper introduces the current state of the OntoLex-FrAC vocabulary, describes its structure, some selected use cases, elementary concepts and fundamental definitions, with a focus on frequency and attestations

OPUS Augsburg

Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference (LREC2022), 20-25 June 2022, Marseille, France

Author: Chiarcos Christian
Declerck Thierry
Ionov Maxim
McCrae John Philip
Montiel Elena
Publication venue
Publication date: 20/04/2023
Field of study

OPUS Augsburg