59 research outputs found
From SGML to XML with TEI: Automated Conversion of a Corpus of Polish from P3 to P4 Format
The article presents experiences gathered in the process of migration of an SGML corpus encoded in TEI P3 format to XML-enabled TEI P4
Inter-Annotator Agreement in Coreference Annotation of Polish
Abstract. This paper discusses different methods of estimating the inter-annotator agreement in manual annotation of Polish coreference and proposes a new BLANC-based annotation agreement metric. The commonly used agreement indicators are calculated for mention detection, semantic head annotation, near-identity markup and coreference resolution
The use of electronic historical dictionary data in corpus design
W Pracowni Historii Języka Polskiego XVII i XVIII w. Instytutu Języka Polskiego Polskiej Akademii Nauk powstają obecnie dwie obszerne bazy danych: Elektroniczny słownik języka polskiego XVII i XVIII w. oraz Elektroniczny korpus tekstów polskich z XVII i XVIII w. (do roku 1772) - ten ostatni we współpracy z Instytutem Podstaw Informatyki PAN. Połączenie tych dwóch zasobów może pomóc zrealizować cele obu projektów. Niniejszy artykuł przedstawia korzyści, jakie mogą odnieść twórcy korpusu, używając danych słownika, m.in. poprzez wykorzystanie informacji gramatycznej z haseł słownika do budowy narzędzi do automatycznej anotacji tekstu.The History of the 17th and 18th c. Polish Language Laboratory, Institute of Polish Language, Polish Academy of Sciences, is in the process of creating two large databases: The Electronic Dictionary of the 17th-18th c. Polish and The Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772), the latter in cooperation with the Institute of Computer Science, Polish Academy of Sciences. It is expected that combining these two sets of data will help to achieve the objectives established for both database projects. The present article shows the benefits that the Corpus creators can get from the data gathered in the dictionary, with special emphasis put on the use of grammatical information included in the dictionary entries to design tools for automatic text annotation in the Corpus
Adopting ISO 24617-8 for Discourse Relations Annotation in Polish: Challenges and Future Directions
This paper explores a discourse relations annotation project carried out under the CLARIN-PL initiative, leveraging the ISO 24617-8 standard. The goal is to boost research interoperability and foster multilingual research. Our team of three linguist-annotators tackled the annotation of a corpus spanning several genres, including e.g., literature and press articles in the Polish language. This effort was guided by a project expert and external linguists from the CLARIN-PL language technology research infrastructure. Several significant challenges emerged during the process. Ambiguities within the ISO standard’s relation categories, poorly-defined definitions for certain relation categories, and the difficulty of identifying and annotating implicit discourse relations, which lack explicit discourse connectives or signaling devices, were among the key issues. To overcome these problems, we implemented strategies such as regular team meetings, collaborative annotation forms, and preliminary revisions to the annotation scheme. This paper presents the project, the annotation process, and offers initial annotation data on the discourse relations and connectives identified within the corpus. Looking forward, we discuss potential enhancements to the process, including additional revisions to the guidelines and conclude with an overview of the project’s contributions and a discussion of our future development plans
Polish Discourse Corpus (PDC): Corpus Design, ISO-Compliant Annotation, Data Highlights, and Parser Development
This paper presents the Polish Discourse Corpus, a pioneering resource of this kind for Polish and the first corpus in Poland to employ the ISO standard for discourse relation annotation. The Polish Discourse Corpus adopts ISO 24617-8, a segment of the Language Resource Management – Semantic Annotation Framework (SemAF), which outlines a set of core discourse relations adaptable for diverse languages and genres. The paper overviews the corpus architecture, annotation procedures, the challenges that the annotators have encountered, as well as key statistical data concerning discourse relations and connectives in the corpus. It further discusses the initial phases of the discourse parser tailored for the ISO 24617-8 framework. Evaluations on the efficacy and potential refinement areas of the corpus annotation and parsing strategies are also presented. The final part of the paper touches upon anticipated research plans to improve discourse analysis techniques in the project and to conduct discourse studies involving multiple languages
The Use of Electronic Historical Dictionary Data in Corpus Design
The History of the 17th and 18th c. Polish Language Laboratory, Institute of Polish Language, Polish Academy of Sciences, is in the process of creating two large databases: The Electronic Dictionary of the 17th−18th c. Polish and The Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772), the latter in cooperation with the Institute of Computer Science, Polish Academy of Sciences. It is expected that combining these two sets of data will help to achieve the objectives established for both database projects. The present article shows the benefits that the Corpus creators can get from the data gathered in the dictionary, with special emphasis put on the use of grammatical information included in the dictionary entries to design tools for automatic text annotation in the Corpus
The Use of Electronic Historical Dictionary Data in Corpus Design
The History of the 17th and 18th c. Polish Language Laboratory, Institute of Polish Language, Polish Academy of Sciences, is in the process of creating two large databases: The Electronic Dictionary of the 17th−18th c. Polish and The Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772), the latter in cooperation with the Institute of Computer Science, Polish Academy of Sciences. It is expected that combining these two sets of data will help to achieve the objectives established for both database projects. The present article shows the benefits that the Corpus creators can get from the data gathered in the dictionary, with special emphasis put on the use of grammatical information included in the dictionary entries to design tools for automatic text annotation in the Corpus
Findings of the Shared Task on Multilingual Coreference Resolution
This paper presents an overview of the shared task on multilingual
coreference resolution associated with the CRAC 2022 workshop. Shared task
participants were supposed to develop trainable systems capable of identifying
mentions and clustering them according to identity coreference. The public
edition of CorefUD 1.0, which contains 13 datasets for 10 languages, was used
as the source of training and evaluation data. The CoNLL score used in previous
coreference-oriented shared tasks was used as the main evaluation metric. There
were 8 coreference prediction systems submitted by 5 participating teams; in
addition, there was a competitive Transformer-based baseline system provided by
the organizers at the beginning of the shared task. The winner system
outperformed the baseline by 12 percentage points (in terms of the CoNLL scores
averaged across all datasets for individual languages)
The strategic impact of META-NET on the regional, national and international level
This article provides an overview of the dissemination work carried out in META-NET from 2010 until 2015; we describe its impact on the regional, national and international level, mainly with regard to politics and the funding situation for LT topics. The article documents the initiative's work throughout Europe in order to boost progress and innovation in our field.Peer reviewe
The ParlaMint corpora of parliamentary proceedings
This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers, and are linguistically annotated following the Universal Dependencies formalism and with named entities. Samples of the corpora and conversion scripts are available from the project’s GitHub repository, and the complete corpora are openly available via the CLARIN.SI repository for download, as well as through the NoSketch Engine and KonText concordancers and the Parlameter interface for on-line exploration and analysis
- …