Search CORE

59 research outputs found

From SGML to XML with TEI: Automated Conversion of a Corpus of Polish from P3 to P4 Format

Author: Ogrodniczuk Maciej
Publication venue: Adam Mickiewicz University Poznan
Publication date: 15/12/2004
Field of study

The article presents experiences gathered in the process of migration of an SGML corpus encoded in TEI P3 format to XML-enabled TEI P4

Biblioteka Nauki - repozytorium artykuÅÃ³w

Investigationes Linguisticae

Inter-Annotator Agreement in Coreference Annotation of Polish

Author: Maciej Ogrodniczuk
Mateusz Kopeć
Publication venue
Publication date: 23/04/2020
Field of study

Abstract. This paper discusses different methods of estimating the inter-annotator agreement in manual annotation of Polish coreference and proposes a new BLANC-based annotation agreement metric. The commonly used agreement indicators are calculated for mention detection, semantic head annotation, near-identity markup and coreference resolution

CiteSeerX

The use of electronic historical dictionary data in corpus design

Author: Bronikowska Renata
Gruszczyński Włodzimierz
Ogrodniczuk Maciej
Woliński Marcin
Publication venue: 'Uniwersytet Jagiellonski - Wydawnictwo Uniwersytetu Jagiellonskiego'
Publication date: 01/01/2016
Field of study

W Pracowni Historii Języka Polskiego XVII i XVIII w. Instytutu Języka Polskiego Polskiej Akademii Nauk powstają obecnie dwie obszerne bazy danych: Elektroniczny słownik języka polskiego XVII i XVIII w. oraz Elektroniczny korpus tekstów polskich z XVII i XVIII w. (do roku 1772) - ten ostatni we współpracy z Instytutem Podstaw Informatyki PAN. Połączenie tych dwóch zasobów może pomóc zrealizować cele obu projektów. Niniejszy artykuł przedstawia korzyści, jakie mogą odnieść twórcy korpusu, używając danych słownika, m.in. poprzez wykorzystanie informacji gramatycznej z haseł słownika do budowy narzędzi do automatycznej anotacji tekstu.The History of the 17th and 18th c. Polish Language Laboratory, Institute of Polish Language, Polish Academy of Sciences, is in the process of creating two large databases: The Electronic Dictionary of the 17th-18th c. Polish and The Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772), the latter in cooperation with the Institute of Computer Science, Polish Academy of Sciences. It is expected that combining these two sets of data will help to achieve the objectives established for both database projects. The present article shows the benefits that the Corpus creators can get from the data gathered in the dictionary, with special emphasis put on the use of grammatical information included in the dictionary entries to design tools for automatic text annotation in the Corpus

Portal Czasopism Naukowych (E-Journals)

Jagiellonian Univeristy Repository

Adopting ISO 24617-8 for Discourse Relations Annotation in Polish: Challenges and Future Directions

Author: Drozd Agata
Ogrodniczuk Maciej
Tomaszewska Aleksandra
Ziembicki Daniel
Żurowski Sebastian
Publication venue: NOVA CLUNL
Publication date: 01/08/2023
Field of study

This paper explores a discourse relations annotation project carried out under the CLARIN-PL initiative, leveraging the ISO 24617-8 standard. The goal is to boost research interoperability and foster multilingual research. Our team of three linguist-annotators tackled the annotation of a corpus spanning several genres, including e.g., literature and press articles in the Polish language. This effort was guided by a project expert and external linguists from the CLARIN-PL language technology research infrastructure. Several significant challenges emerged during the process. Ambiguities within the ISO standard’s relation categories, poorly-defined definitions for certain relation categories, and the difficulty of identifying and annotating implicit discourse relations, which lack explicit discourse connectives or signaling devices, were among the key issues. To overcome these problems, we implemented strategies such as regular team meetings, collaborative annotation forms, and preliminary revisions to the annotation scheme. This paper presents the project, the annotation process, and offers initial annotation data on the discourse relations and connectives identified within the corpus. Looking forward, we discuss potential enhancements to the process, including additional revisions to the guidelines and conclude with an overview of the project’s contributions and a discussion of our future development plans

Repository of Nicolaus Copernicus University

Polish Discourse Corpus (PDC): Corpus Design, ISO-Compliant Annotation, Data Highlights, and Parser Development

Author: Ogrodniczuk Maciej
Tomaszewska Aleksandra
Tuora Ryszard
Ziembicki Daniel
Zwierzchowska Aleksandra
Żurowski Sebastian
Publication venue: ELRA Language Resource Association
Publication date: 01/05/2024
Field of study

This paper presents the Polish Discourse Corpus, a pioneering resource of this kind for Polish and the first corpus in Poland to employ the ISO standard for discourse relation annotation. The Polish Discourse Corpus adopts ISO 24617-8, a segment of the Language Resource Management – Semantic Annotation Framework (SemAF), which outlines a set of core discourse relations adaptable for diverse languages and genres. The paper overviews the corpus architecture, annotation procedures, the challenges that the annotators have encountered, as well as key statistical data concerning discourse relations and connectives in the corpus. It further discusses the initial phases of the discourse parser tailored for the ISO 24617-8 framework. Evaluations on the efficacy and potential refinement areas of the corpus annotation and parsing strategies are also presented. The final part of the paper touches upon anticipated research plans to improve discourse analysis techniques in the project and to conduct discourse studies involving multiple languages

Repository of Nicolaus Copernicus University

The Use of Electronic Historical Dictionary Data in Corpus Design

Author: Bronikowska Renata
Gruszczyński Włodzimierz
Ogrodniczuk Maciej
Woliński Marcin
Publication venue: Studies in Polish Linguistics
Publication date: 07/07/2016
Field of study

The History of the 17th and 18th c. Polish Language Laboratory, Institute of Polish Language, Polish Academy of Sciences, is in the process of creating two large databases: The Electronic Dictionary of the 17th−18th c. Polish and The Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772), the latter in cooperation with the Institute of Computer Science, Polish Academy of Sciences. It is expected that combining these two sets of data will help to achieve the objectives established for both database projects. The present article shows the benefits that the Corpus creators can get from the data gathered in the dictionary, with special emphasis put on the use of grammatical information included in the dictionary entries to design tools for automatic text annotation in the Corpus

Portal Czasopism Naukowych (E-Journals)

The Use of Electronic Historical Dictionary Data in Corpus Design

Author: Bronikowska Renata
Gruszczyński Włodzimierz
Ogrodniczuk Maciej
Woliński Marcin
Publication venue: Studies in Polish Linguistics
Publication date: 14/06/2016
Field of study

Portal Czasopism Naukowych (E-Journals)

Findings of the Shared Task on Multilingual Coreference Resolution

Author: Konopík Miloslav
Nedoluzhko Anna
Novák Michal
Ogrodniczuk Maciej
Popel Martin
Pražák Ondřej
Sido Jakub
Zeman Daniel
Zhu Yilun
Žabokrtský Zdeněk
Publication venue
Publication date: 16/09/2022
Field of study

This paper presents an overview of the shared task on multilingual coreference resolution associated with the CRAC 2022 workshop. Shared task participants were supposed to develop trainable systems capable of identifying mentions and clustering them according to identity coreference. The public edition of CorefUD 1.0, which contains 13 datasets for 10 languages, was used as the source of training and evaluation data. The CoNLL score used in previous coreference-oriented shared tasks was used as the main evaluation metric. There were 8 coreference prediction systems submitted by 5 participating teams; in addition, there was a competitive Transformer-based baseline system provided by the organizers at the beginning of the shared task. The winner system outperformed the baseline by 12 percentage points (in terms of the CoNLL scores averaged across all datasets for individual languages)

arXiv.org e-Print Archive

The strategic impact of META-NET on the regional, national and international level

This article provides an overview of the dissemination work carried out in META-NET from 2010 until 2015; we describe its impact on the regional, national and international level, mainly with regard to politics and the funding situation for LT topics. The article documents the initiative's work throughout Europe in order to boost progress and innovation in our field.Peer reviewe

Crossref

Institutional Repository Universiteit Antwerpen

The University of Manchester - Institutional Repository

Helsingin yliopiston digitaalinen arkisto

Utrecht University Repository

The ParlaMint corpora of parliamentary proceedings

Author: Agnoloni Tommaso
Barkarson Starkaður
Coole Matthew
Darǵis Roberts
de Does Jesse
de Macedo Luciana D.
Depuydt Katrien
Erjavec Tomaž
Fišer Darja
Kopp Matyáš
Krilavičius Tomas
Ljubešić Nikola
Luxardo Giancarlo
Marx Maarten
Morkevičius Vaidas
Navarretta Costanza
Ogrodniczuk Maciej
Osenova Petya
Pančur Andrej
Pérez María Calzada
Rayson Paul
Ring Orsolya
Rudolf Michał
Simov Kiril
Steingrímsson Steinþór
van Heusden Ruben
Venturi Giulia
Çöltekin Çağrı
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers, and are linguistically annotated following the Universal Dependencies formalism and with named entities. Samples of the corpora and conversion scripts are available from the project’s GitHub repository, and the complete corpora are openly available via the CLARIN.SI repository for download, as well as through the NoSketch Engine and KonText concordancers and the Parlameter interface for on-line exploration and analysis

PubMed Central

Copenhagen University Research Information System

Repositori Institucional de la Universitat Jaume I

Lancaster E-Prints

International Migration, Integration and Social Cohesion online publications

UvA-DARE