Search CORE

21 research outputs found

Czech Text Document Corpus v 2.0

Author: Král Pavel
Lenc Ladislav
Publication venue
Publication date: 30/01/2018
Field of study

This paper introduces "Czech Text Document Corpus v 2.0", a collection of text documents for automatic document classification in Czech language. It is composed of the text documents provided by the Czech News Agency and is freely available for research purposes at http://ctdc.kiv.zcu.cz/. This corpus was created in order to facilitate a straightforward comparison of the document classification approaches on Czech data. It is particularly dedicated to evaluation of multi-label document classification approaches, because one document is usually labelled with more than one label. Besides the information about the document classes, the corpus is also annotated at the morphological layer. This paper further shows the results of selected state-of-the-art methods on this corpus to offer the possibility of an easy comparison with these approaches.Comment: Accepted for LREC 201

arXiv.org e-Print Archive

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

RuSentNE-2023: Evaluating Entity-Oriented Sentiment Analysis on Russian News Texts

Author: Golubev Anton
Loukachevitch Natalia
Rusnachenko Nicolay
Publication venue
Publication date: 28/05/2023
Field of study

The paper describes the RuSentNE-2023 evaluation devoted to targeted sentiment analysis in Russian news texts. The task is to predict sentiment towards a named entity in a single sentence. The dataset for RuSentNE-2023 evaluation is based on the Russian news corpus RuSentNE having rich sentiment-related annotation. The corpus is annotated with named entities and sentiments towards these entities, along with related effects and emotional states. The evaluation was organized using the CodaLab competition framework. The main evaluation measure was macro-averaged measure of positive and negative classes. The best results achieved were of 66% Macro F-measure (Positive+Negative classes). We also tested ChatGPT on the test set from our evaluation and found that the zero-shot answers provided by ChatGPT reached 60% of the F-measure, which corresponds to 4th place in the evaluation. ChatGPT also provided detailed explanations of its conclusion. This can be considered as quite high for zero-shot application.Comment: 12 pages, 5 tables, 3 figure

arXiv.org e-Print Archive

Media monitoring and information extraction for the highly inflected agglutinative language Hungarian

Author: Eszter Simon
Júlia Pajzs
Leonida Della Rocca
Maud Ehrmann
Mohamed Ebrahim
Ralf Steinberger
Stefano Bucci
Tamás Váradi
Publication venue: ELRA
Publication date: 01/01/2014
Field of study

The Europe Media Monitor (EMM) is a fully-automatic system that analyses written online news by gathering articles in over 70 languages and by applying text analysis software for currently 21 languages, without using linguistic tools such as parsers, part-of-speech taggers or morphological analysers. In this paper, we describe the effort of adding to EMM Hungarian text mining tools for news gathering; document categorisation; named entity recognition and classification for persons, organisations and locations; name lemmatisation; quotation recognition; and cross-lingual linking of related news clusters. The major challenge of dealing with the Hungarian language is its high degree of inflection and agglutination. We present several experiments where we apply linguistically light-weight methods to deal with inflection and we propose a method to overcome the challenges. We also present detailed frequency lists of Hungarian person and location name suffixes, as found in real-life news texts. This empirical data can be used to draw further conclusions and to improve existing Named Entity Recognition software. Within EMM, the solutions described here will also be applied to other morphologically complex languages such as those of the Slavic language family. The media monitoring and analysis system EMM is freely accessible online via the web pag

CiteSeerX

Repository of the Academy's Library

Tulajdonnevek felismerése az Európai Média Monitor magyar moduljának fejlesztésében

Author: Pajzs Júlia
Publication venue
Publication date: 01/01/2015
Field of study

ELTE Digital Institutional Repository (EDIT)

Tulajdonnevek felismerése az Európai Média Monitor magyar moduljának fejlesztésében

Author: Pajzs Júlia
Publication venue
Publication date: 01/01/2015
Field of study

Repository of the Academy's Library

ELTE Digital Institutional Repository (EDIT)

Cross-lingual transfer of knowledge in distributional language models: Experiments in Hungarian

Author: Novák Attila
Novák Borbála
Publication venue: 'Akademiai Kiado Zrt.'
Publication date: 01/01/2022
Field of study

Repository of the Academy's Library

Tehisnärvivõrgul põhinevate lemmatiseerijate võrdlev analüüs eesti keeles

Author: Leman Laura Katrin
Publication venue: Tartu Ülikool
Publication date: 01/06/2019
Field of study

https://www.ester.ee/record=b5242129*es

DSpace at Tartu University Library

CLARIN. The infrastructure for language resources

Author: Fišer Darja
Witt Andreas
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 17/10/2022
Field of study

CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future. The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)

Publikationsserver des Instituts für Deutsche Sprache

CLARIN

Author
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 30/01/2023
Field of study

The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium

Directory of Open Access Books (DOAB)

CLARIN

Author
Publication venue: 'Walter de Gruyter GmbH'
Publication date
Field of study

OAPEN Library