516 research outputs found
Romanian Language Technology — a view from an academic perspective
The article reports on research and developments pursued by the Research Institute for Artificial Intelligence "Mihai Draganescu" of the Romanian Academy in order to narrow the gaps identified by the deep analysis on the European languages made by Meta-Net white papers and published by Springer in 2012. Except English, all the European languages needed significant research and development in order to reach an adequate technological level, in line with the expectations and requirements of the knowledge society
Nabra: Syrian Arabic Dialects with Morphological Annotations
This paper presents Nabra, a corpora of Syrian Arabic dialects with
morphological annotations. A team of Syrian natives collected more than 6K
sentences containing about 60K words from several sources including social
media posts, scripts of movies and series, lyrics of songs and local proverbs
to build Nabra. Nabra covers several local Syrian dialects including those of
Aleppo, Damascus, Deir-ezzur, Hama, Homs, Huran, Latakia, Mardin, Raqqah, and
Suwayda. A team of nine annotators annotated the 60K tokens with full
morphological annotations across sentence contexts. We trained the annotators
to follow methodological annotation guidelines to ensure unique morpheme
annotations, and normalized the annotations. F1 and kappa agreement scores
ranged between 74% and 98% across features, showing the excellent quality of
Nabra annotations. Our corpora are open-source and publicly available as part
of the Currasat portal https://sina.birzeit.edu/currasat
KYOTO: A System for Mining, Structuring, and Distributing Knowledge Across Languages and Cultures
We outline work performed within the framework of a current EC project. The goal is to construct a language-independent information system for a specific domain (environment/ecology/biodiversity) anchored in a language-independent ontology that is linked to wordnets in seven languages. For each language, information extraction and identification of lexicalized concepts with ontological entries is carried out by text miners (?Kybots?). The mapping of language-specific lexemes to the ontology allows for crosslinguistic identification and translation of equivalent terms. The infrastructure developed within this project enables long-range knowledge sharing and transfer across many languages and cultures, addressing the need for global and uniform transition of knowledge beyond the specific domains addressed here
Media monitoring and information extraction for the highly inflected agglutinative language Hungarian
The Europe Media Monitor (EMM) is a fully-automatic system that analyses written online news by gathering articles in over 70 languages and by applying text analysis software for currently 21 languages, without using linguistic tools such as parsers, part-of-speech taggers or morphological analysers. In this paper, we describe the effort of adding to EMM Hungarian text mining tools for news gathering; document categorisation; named entity recognition and classification for persons, organisations and locations; name lemmatisation; quotation recognition; and cross-lingual linking of related news clusters. The major challenge of dealing with the Hungarian language is its high degree of inflection and agglutination. We present several experiments where we apply linguistically light-weight methods to deal with inflection and we propose a method to overcome the challenges. We also present detailed frequency lists of Hungarian person and location name suffixes, as found in real-life news texts. This empirical data can be used to draw further conclusions and to improve existing Named Entity Recognition software. Within EMM, the solutions described here will also be applied to other morphologically complex languages such as those of the Slavic language family. The media monitoring and analysis system EMM is freely accessible online via the web pag
- …