36 research outputs found
CLARIN
The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium
CLARIN. The infrastructure for language resources
CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future.
The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)
SGL: Speaking the Graph Languages of Semantic Parsing via Multilingual Translation.
Graph-based semantic parsing aims to represent textual meaning through directed graphs. As one of the most promising general-purpose meaning representations, these structures and their parsing have gained a significant interest momentum during recent years, with several diverse formalisms being proposed. Yet, owing to this very heterogeneity, most of the research effort has focused mainly on solutions specific to a given formalism. In this work, instead, we reframe semantic parsing towards multiple formalisms as Multilingual Neural Machine Translation (MNMT), and propose SGL, a many-to-many seq2seq architecture trained with an MNMT objective. Backed by several experiments, we show that this framework is indeed effective once the learning procedure is enhanced with large parallel corpora coming from Machine Translation: we report competitive performances on AMR and UCCA parsing, especially once paired with pre-trained architectures. Furthermore, we find that models trained under this configuration scale remarkably well to tasks such as cross-lingual AMR parsing: SGL outperforms all its competitors by a large margin without even explicitly seeing non-English to AMR examples at training time and, once these examples are included as well, sets an unprecedented state of the art in this task. We release our code and our models for research purposes at https://github.com/SapienzaNLP/sgl
Understanding and generating language with abstract meaning representation
Abstract Meaning Representation (AMR) is a semantic representation for natural
language that encompasses annotations related to traditional tasks such as
Named Entity Recognition (NER), Semantic Role Labeling (SRL), word sense
disambiguation (WSD), and Coreference Resolution. AMR represents sentences
as graphs, where nodes represent concepts and edges represent semantic
relations between them.
Sentences are represented as graphs and not trees because nodes can have
multiple incoming edges, called reentrancies. This thesis investigates the impact
of reentrancies for parsing (from text to AMR) and generation (from AMR
to text). For the parsing task, we showed that it is possible to use techniques
from tree parsing and adapt them to deal with reentrancies. To better analyze
the quality of AMR parsers, we developed a set of fine-grained metrics
and found that state-of-the-art parsers predict reentrancies poorly. Hence we
provided a classification of linguistic phenomena causing reentrancies, categorized
the type of errors parsers do with respect to reentrancies, and proved
that correcting these errors can lead to significant improvements. For the generation
task, we showed that neural encoders that have access to reentrancies
outperform those who do not, demonstrating the importance of reentrancies
also for generation.
This thesis also discusses the problem of using AMR for languages other
than English. Annotating new AMR datasets for other languages is an expensive
process and requires defining annotation guidelines for each new language.
It is therefore reasonable to ask whether we can share AMR annotations
across languages. We provided evidence that AMR datasets for English
can be successfully transferred to other languages: we trained parsers for Italian,
Spanish, German, and Chinese to investigate the cross-linguality of AMR.
We showed cases where translational divergences between languages pose a
problem and cases where they do not. In summary, this thesis demonstrates
the impact of reentrancies in AMR as well as providing insights on AMR for
languages that do not yet have AMR datasets
Multiword expressions at length and in depth
The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work
Automated Translation with Interlingual Word Representations
In dit proefschrift onderzoeken we het gebruik vertaalsystemen die gebruiken maken van een transferfase met interlinguale representaties van woorden. Op deze manier benaderen we het probleem van de lexicale ambiguïteit in de automatische vertaalsystemen als twee afzonderlijke taken: het bepalen van woordbetekenis en lexicale selectie. Eerst worden de woorden in de brontaal op basis van hun betekenis gedesambigueerd, resulterend in interlinguale representaties van woorden. Vervolgens wordt een lexicale selectiemodule gebruikt die het meest geschikte woord in de doeltaal selecteert. We geven een gedetailleerde beschrijving van de ontwikkeling en evaluatie van vertaalsystemen voor Nederlands-Engels. Dit biedt een achtergrond voor de experimenten in het tweede en derde deel van dit proefschrift. Daarna beschrijven we een methode die de betekenis van woorden bepaalt. Deze is vergelijkbaar met het klassieke Lesk-algoritme, omdat het gebruik maakt van het idee dat gedeelde woorden tussen de context van een woord en zijn definitie informatie over de betekenis ervan verschaffen. Wij gebruiken echter, in plaats daarvan, woord- en betekenisvectoren om de overeenkomst te berekenen tussen de definitie van een betekenis en de context van een woord. We gebruiken onze methode bovendien voor het localiseren en -interpreteren van woordgrapjes.Ten slotte presenteren we een model voor lexicale keuze dat lemma's selecteert, gegeven de abstracte representaties van woorden. Dit doen we door de grammaticale bomen te converteren naar hidden Markov bomen. Op deze manier kan de optimale combinatie van lemmas en hun context berekend worden