Search CORE

8 research outputs found

Automatic propbank generation for Turkish

Author: Ak Koray
Yıldız Olcay Taner
Publication venue: 'Assoc. for Computational Linguistics Bulgaria'
Publication date: 01/09/2019
Field of study

Semantic role labeling (SRL) is an important task for understanding natural languages, where the objective is to analyse propositions expressed by the verb and to identify each word that bears a semantic role. It provides an extensive dataset to enhance NLP applications such as information retrieval, machine translation, information extraction, and question answering. However, creating SRL models are difficult. Even in some languages, it is infeasible to create SRL models that have predicate-argument structure due to lack of linguistic resources. In this paper, we present our method to create an automatic Turkish PropBank by exploiting parallel data from the translated sentences of English PropBank. Experiments show that our method gives promising results. © 2019 Association for Computational Linguistics (ACL).Publisher's Versio

Crossref

Isik University Academic Open Access

FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection

Author: Aananou Soukaïna
Bjelogrlic Mina
Gaudet-Blavignac Christophe
Goldman Jean-Philippe
Lovis Christian
Zaghir Jamil
Publication venue
Publication date: 19/09/2023
Field of study

Natural language processing (NLP) applications such as named entity recognition (NER) for low-resource corpora do not benefit from recent advances in the development of large language models (LLMs) where there is still a need for larger annotated datasets. This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection. Leveraging a language agnostic BERT-based approach, it is an efficient solution to increase low-resource corpora with few human efforts and by only using already available open data resources. Quantitative and qualitative evaluations are often lacking when it comes to evaluating the quality and effectiveness of semi-automatic data generation strategies. The evaluation of our crosslingual annotation projection approach showed both effectiveness and high accuracy in the resulting dataset. As a practical application of this methodology, we present the creation of French Annotated Resource with Semantic Information for Medical Entities Detection (FRASIMED), an annotated corpus comprising 2'051 synthetic clinical cases in French. The corpus is now available for researchers and practitioners to develop and refine French natural language processing (NLP) applications in the clinical field (https://zenodo.org/record/8355629), making it the largest open annotated corpus with linked medical concepts in French

arXiv.org e-Print Archive

Global methods for cross-lingual semantic role and predicate labelling

Author: Apidianaki Marianna
Chen Chenhua
Van Der Plas Lonneke
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

International audienceWe address the problem of transferring semantic annotations to new languages using parallel corpora. Previous work has transferred these annotations on a token-to-token basis, an approach that is sensitive to alignment errors and translation shifts. We present a global approach to transfer that aggregates information across the whole parallel corpus and leads to more robust labellers. We build two global models, one for predicate labelling and one for role labelling, each tailored to the task at hand. We show that the combination of direct and global methods outperforms previous results

Recommended from our members

Adapting Semantic Role Labeling to New Genres and Languages

Author: Myers Skatje Katharina
Publication venue: University of Colorado Boulder
Publication date: 01/08/2023
Field of study

Semantic role labeling (SRL) is the identification of semantic predicates and their participants within a sentence, which is vital for deeper natural language understanding. State-of-the-art SRL models require annotated text for training, but those annotations don't exist for many languages and domains. The ability to annotate new corpora is hampered by limited time and budget. We explore two different ways of reducing the annotation required to produce SRL systems for new domains or languages: active learning and annotation projection. Active learning reduces annotation requirements by selecting just the most informative training instances through an iterative process of training and annotation. In this work, we investigate the use of Bayesian Active Learning by Disagreement, ways of tuning it for SRL, and assessing its performance across multiple corpora. We study the choices being made by different selection methods over the course of iterations, examining vocabulary coverage, diversity, predicates selected, and the shifts in confidence. We also explore the impact of various strategies of selecting the initial training data. We investigate a number of potentially influential factors within batches of queries, such as diversity and disagreement scores. In order to reduce the overhead of training time, we additionally compare the effect of increasing the amount of queries being selected on each iteration. Abstract Meaning Representations (AMRs) are increasingly popular semantic representations of whole sentences. Based on our successful results using active learning to assess the informativeness of annotation instances for SRL, we look into whether the commonalities between these representations can be leveraged to supply targeted annotation for AMR parsing. Finally, we explore annotation projection of SRL. This approach attempts to create semantic annotations in a target language given parallel translations that have been given SRL annotations through manual or automatic means. We assess the recently developed Russian PropBank and the feasibility of generating the same semantic annotations by projecting from the English PropBank annotation. We use both our own system with English-Russian automatic word alignments and the recent Universal PropBanks 2.0. We examine the types of errors that arise from inconsistencies or gaps in annotations as well as systemic issues arising from the strong English-bias of the projections. This analysis leads us to the development of several filtering techniques that improve the precision of the projections.</p

CU Scholar Institutional Repository

Learning constructions from bilingual exposure:Computational studies of argument structure acquisition

Author: Matusevych Yevgen
Publication venue: LOT Netherlands Graduate School of Linguistics
Publication date: 01/01/2016
Field of study

Tilburg University Repository