Search CORE

516 research outputs found

The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch

Author: A Bosch Van den
A Braasch
C Rijsbergen Van
G Aston
J Leveling
J Trapman
JC Carletta
M Recasens
M Reynaert
Martin W. C. Reynaert
W Daelemans
W Daelemans
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Crossref

Springer - Publisher Connector

Tilburg University Repository

Romanian Language Technology — a view from an academic perspective

Author: Tufiș Dan
Publication venue: 'Agora University of Oradea'
Publication date: 05/01/2022
Field of study

The article reports on research and developments pursued by the Research Institute for Artificial Intelligence "Mihai Draganescu" of the Romanian Academy in order to narrow the gaps identified by the deep analysis on the European languages made by Meta-Net white papers and published by Springer in 2012. Except English, all the European languages needed significant research and development in order to reach an adequate technological level, in line with the expectations and requirements of the knowledge society

Agora University Editing House: Journals

Nabra: Syrian Arabic Dialects with Morphological Annotations

Author: Hammouda Tymaa
Jarrar Mustafa
Kurdy Mohamad-Bassam
Nayouf Amal
Zaraket Fadi
Publication venue
Publication date: 26/10/2023
Field of study

This paper presents Nabra, a corpora of Syrian Arabic dialects with morphological annotations. A team of Syrian natives collected more than 6K sentences containing about 60K words from several sources including social media posts, scripts of movies and series, lyrics of songs and local proverbs to build Nabra. Nabra covers several local Syrian dialects including those of Aleppo, Damascus, Deir-ezzur, Hama, Homs, Huran, Latakia, Mardin, Raqqah, and Suwayda. A team of nine annotators annotated the 60K tokens with full morphological annotations across sentence contexts. We trained the annotators to follow methodological annotation guidelines to ensure unique morpheme annotations, and normalized the annotations. F1 and kappa agreement scores ranged between 74% and 98% across features, showing the excellent quality of Nabra annotations. Our corpora are open-source and publicly available as part of the Currasat portal https://sina.birzeit.edu/currasat

arXiv.org e-Print Archive

KYOTO: A System for Mining, Structuring, and Distributing Knowledge Across Languages and Cultures

Author: Agirre Eneko
Calzolari Nicoletta
Fellbaum Christiane
Hsieh Shu-Kai
Huang Chu-Ren
Isahara Hitoshi
Kanzaki Kyoko
Marchetti Andrea
Monachini Monica
Neri Federico
Raffaelli Remo
Rigau German
Tesconi Maurizio
VanGent Joop
Vossen Piek
Publication venue: European Language Resources Association (ELRA)
Publication date
Field of study

We outline work performed within the framework of a current EC project. The goal is to construct a language-independent information system for a specific domain (environment/ecology/biodiversity) anchored in a language-independent ontology that is linked to wordnets in seven languages. For each language, information extraction and identification of lexicalized concepts with ontological entries is carried out by text miners (?Kybots?). The mapping of language-specific lexemes to the ontology allows for crosslinguistic identification and translation of equivalent terms. The infrastructure developed within this project enables long-range knowledge sharing and transfer across many languages and cultures, addressing the need for global and uniform transition of knowledge beyond the specific domains addressed here

PUblication MAnagement

Media monitoring and information extraction for the highly inflected agglutinative language Hungarian

Author: Eszter Simon
Júlia Pajzs
Leonida Della Rocca
Maud Ehrmann
Mohamed Ebrahim
Ralf Steinberger
Stefano Bucci
Tamás Váradi
Publication venue: ELRA
Publication date: 01/01/2014
Field of study

The Europe Media Monitor (EMM) is a fully-automatic system that analyses written online news by gathering articles in over 70 languages and by applying text analysis software for currently 21 languages, without using linguistic tools such as parsers, part-of-speech taggers or morphological analysers. In this paper, we describe the effort of adding to EMM Hungarian text mining tools for news gathering; document categorisation; named entity recognition and classification for persons, organisations and locations; name lemmatisation; quotation recognition; and cross-lingual linking of related news clusters. The major challenge of dealing with the Hungarian language is its high degree of inflection and agglutination. We present several experiments where we apply linguistically light-weight methods to deal with inflection and we propose a method to overcome the challenges. We also present detailed frequency lists of Hungarian person and location name suffixes, as found in real-life news texts. This empirical data can be used to draw further conclusions and to improve existing Named Entity Recognition software. Within EMM, the solutions described here will also be applied to other morphologically complex languages such as those of the Slavic language family. The media monitoring and analysis system EMM is freely accessible online via the web pag

CiteSeerX

Repository of the Academy's Library