9 research outputs found
Evaluating Multilingual Sentence Representation Models in a Real Case Scenario
In this paper, we present an evaluation of sentence representation models on the paraphrase detection task. The evaluation is designed to simulate a real-world problem of plagiarism and is based on one of the most important cases of forgery in modern history: the so-called {``}Protocols of the Elders of Zion{''}. The sentence pairs for the evaluation are taken from the infamous forged text {``}Protocols of the Elders of Zion{''} (Protocols) by unknown authors; and by {``}Dialogue in Hell between Machiavelli and Montesquieu{''} by Maurice Joly. Scholars have demonstrated that the first text plagiarizes from the second, indicating all the forged parts on qualitative grounds. Following this evidence, we organized the rephrased texts and asked native speakers to quantify the level of similarity between each pair. We used this material to evaluate sentence representation models in two languages: English and French, and on three tasks: similarity correlation, paraphrase identification, and paraphrase retrieval. Our evaluation aims at encouraging the development of benchmarks based on real-world problems, as a means to prevent problems connected to AI hypes, and to use NLP technologies for social good. Through our evaluation, we are able to confirm that the infamous Protocols are actually a plagiarized text but, as we will show, we encounter several problems connected with the convoluted nature of the task, that is very different from the one reported in standard benchmarks of paraphrase detection and sentence similarity. Code and data available at https://github.com/roccotrip/protocols
XL-AMR: Enabling Cross-Lingual AMR Parsing with Transfer Learning Techniques
Abstract Meaning Representation (AMR) is a popular formalism of natural language that represents the meaning of a sentence as a semantic graph. It is agnostic about how to derive meanings from strings and for this reason it lends itself well to the encoding of semantics across languages. However, cross-lingual AMR parsing is a hard task, because training data are scarce in languages other than English and the existing English AMR parsers are not directly suited to being used in a cross-lingual setting. In this work we tackle these two problems so as to enable cross-lingual AMR parsing: we explore different transfer learning techniques for producing automatic AMR annotations across languages and develop a cross-lingual AMR parser, XL-AMR. This can be trained on the produced data and does not rely on AMR aligners or source-copy mechanisms as is commonly the case in English AMR parsing. The results of XL-AMR significantly surpass those previously reported in Chinese, German, Italian and Spanish. Finally we provide a qualitative analysis which sheds light on the suitability of AMR across languages. We release XL-AMR at github.com/SapienzaNLP/xl-amr
From shallow to whole-sentence semantics: semantic parsing in English and beyond
Humans want to speak to computers using the same language they speak to each other, rather than the symbolic and structured language machines are designed to process. Indeed, enabling a machine to process and interpret text automatically and then communicate verbally is one of the critical goals of the Natural Language Processing (NLP) and broader, the Artificial Intelligence (AI) fields. Moreover, computers are desired not to only process some written text, but also to understand it at the semantic and pragmatic level, which is further defined within the Natural Language Understanding (NLU) subfield. NLU aims at overcoming language ambiguities and complexities to enable machines to read and comprehend text. Therefore, to achieve this goal, we need computers capable of inputting text, preferably in any language, and parsing it into semantic representations which can be used as an interface between humans and computer language. To this end, a crucial issue faced by the NLP researchers is how to devise a language that is interpretable by machines and at the same time expresses the meaning of natural language, primarily known as the Semantic Parsing task. Semantic representations usually take the form of graph-like structures where words in a sentence are interconnected according to different semantic relations. Over time, this has garnered increasing attention, with researchers developing various formalisms that capture complementary aspects of meaning. Two of the most popular formalisms in NLP that capture different levels of sentence semantics are Semantic Role Labeling (SRL) — often referred to as shallow Semantic Parsing — and Abstract Meaning Representation (AMR) — a popular complete formal language for Semantic Parsing — which includes SRL, among other NLP tasks. Both SRL and AMR have been widely studied in the NLP research, counting a large number of approaches to deal with task specificities and the challenges they pose, aiming at achieving human-like performance. In particular, the majority of the SRL works rely on task-specific sequence labeling approaches. In addition, they often make use of third-party components to solve subtasks of SRL, leading to non-end-to-end approaches. We observe a similar trend in AMR related research, where aspects of meaning are treated as a different constituent in a long pipeline. These complexities, which we will elaborate on more during this thesis, may hinder the effectiveness of the models in out-of-distribution settings while also making it more challenging to integrate SRL and AMR structures in downstream tasks of NLU efficiently. Another long-standing problem in NLP is that of enabling research in languages other than English. Especially in the context of AMR , the English dependency problem is even more evident provided that it was initially designed to represent the meaning of English sentences. In this thesis we investigate the aforementioned problems in SRL, including both dependency- and span-based SRL formulations, and in AMR , including AMR parsing — the task of converting utterances into an AMR graph — and its specular counterpart AMR generation — the task of generating natural language utterances from an AMR graph. We focus on relieving the burden of complex, task-specific architectures for English SRL and AMR casting them as sequence generation problems, motivated by the overgrowing success of general-purpose sequence-to-sequence methodologies in NLP in the recent years. Furthermore, we dispose of the previously necessary third-party dependencies in AMR parsing, thus achieving a full symmetry with its dual counterpart, AMR generation. Additionally, we make use of the sequence-to-sequence paradigm and transfer learning techniques to enable cross-lingual AMR parsing — the task of learning English-centric structures to represent meaning in multiple languages
One SPRING to Rule Them Both: Symmetric AMR Semantic Parsing and Generation without a Complex Pipeline
In Text-to-AMR parsing, current state-of-the-art semantic parsers use cumbersome pipelines integrating several different modules or components, and exploit graph recategorization, i.e., a set of content-specific heuristics that are developed on the basis of the training set. However, the generalizability of graph recategorization in an out-of-distribution setting is unclear. In contrast, state-of-the-art AMR-to-Text generation, which can be seen as the inverse to parsing, is based on simpler seq2seq. In this paper, we cast Text-to-AMR and AMR-to-Text as a symmetric transduction task and show that by devising a careful graph linearization and extending a pretrained encoder-decoder model, it is possible to obtain state-of-the-art performances in both tasks using the very same seq2seq approach, i.e., SPRING (Symmetric PaRsIng aNd Generation). Our model does not require complex pipelines, nor heuristics built on heavy assumptions. In fact, we drop the need for graph recategorization, showing that this technique is actually harmful outside of the standard benchmark. Finally, we outperform the previous state of the art on the English AMR 2.0 dataset by a large margin: on Text-to-AMR we obtain an improvement of 3.6 Smatch points, while on AMR-to-Text we outperform the state of the art by 11.2 BLEU points. We release the software at github.com/SapienzaNLP/spring
STEPS: Semantic Typing of Event Processes with a Sequence-to-Sequence Approach
Enabling computers to comprehend the intent of human actions by processing language is one of the fundamental goals of Natural Language Understanding.
An emerging task in this context is that of free-form event process typing, which aims at understanding the overall goal of a protagonist in terms of an action and an object, given a sequence of events.
This task was initially treated as a learning-to-rank problem by exploiting the similarity between processes and action/object textual definitions.
However, this approach appears to be overly complex, binds the output types to a fixed inventory for possible word definitions and, moreover, leaves space for further enhancements as regards performance.
In this paper, we advance the field by reformulating the free-form event process typing task as a sequence generation problem and put forward STEPS, an end-to-end approach for producing user intent in terms of actions and objects only, dispensing with the need for their definitions.
In addition to this, we eliminate several dataset constraints set by previous works, while at the same time significantly outperforming them.
We release the data and software at https://github.com/SapienzaNLP/steps
BabelNet Meaning Representation: A Fully Semantic Formalism to Overcome Language Barriers
Conceptual representations of meaning have long been the general focus of Artificial Intelligence (AI) towards the fundamental goal of machine understanding, with innumerable efforts made in Knowledge Representation, Speech and Natural Language Processing, Computer Vision, inter alia. Even today, at the core of Natural Language Understanding lies the task of Semantic Parsing, the objective of which is to convert natural sentences into machine-readable representations. Through this paper, we aim to revamp the historical dream of AI, by putting forward a novel, all-embracing, fully semantic meaning representation, that goes beyond the many existing formalisms. Indeed, we tackle their key limits by fully abstracting text into meaning and introducing language-independent concepts and semantic relations, in order to obtain an interlingual representation. Our proposal aims to overcome the language barrier, and connect not only texts across languages, but also images, videos, speech and sound, and logical formulas, across many fields of AI
Generating Senses and RoLes: An End-to-End Model for Dependency- and Span-based Semantic Role Labeling
Despite the recent great success of the sequence-to-sequence paradigm in Natural Language Processing, the majority of current studies in Semantic Role Labeling (SRL) still frame the problem as a sequence labeling task. In this paper we go against the flow and propose GSRL (Generating Senses and RoLes), the first sequence-to-sequence model for end-to-end SRL. Our approach benefits from recently-proposed decoder-side pretraining techniques to generate both sense and role labels for all the predicates in an input sentence at once, in an end-to-end fashion. Evaluated on standard gold benchmarks, GSRL achieves state-of-the-art results in both dependency- and span-based English SRL, proving empirically that our simple generation-based model can learn to produce complex predicate-argument structures. Finally, we propose a framework for evaluating the robustness of an SRL model in a variety of synthetic low-resource scenarios which can aid human annotators in the creation of better, more diverse, and more challenging gold datasets. We release GSRL at github.com/SapienzaNLP/gsrl
SPRING Goes Online: End-to-End AMR Parsing and Generation
In this paper we present SPRING Online Services, a Web interface and RESTful APIs for our state-of-the-art AMR parsing and generation system, SPRING (Symmetric PaRsIng aNd Generation). The Web interface has been developed to be easily used by the Natural Language Processing community, as well as by the general public. It provides, among other things, a highly interactive visualization platform and a feedback mechanism to obtain user suggestions for further improvements of the system’s output. Moreover, our RESTful APIs enable easy integration of SPRING in downstream applications where AMR structures are needed. Finally, we make SPRING Online Services freely available at http://nlp.uniroma1.it/spring and, in addition, we release extra model checkpoints to be used with the original SPRING Python code
IR like a SIR: Sense-enhanced Information Retrieval for Multiple Languages
With the advent of contextualized embeddings, attention towards neural ranking approaches for Information Retrieval increased considerably. However, two aspects have remained largely neglected: i) queries usually consist of few keywords only, which increases ambiguity and makes their contextualization harder, and ii) performing neural ranking on non-English documents is still cumbersome due to shortage of labeled datasets. In this paper we present SIR (Sense-enhanced Information Retrieval) to mitigate both problems by leveraging word sense information. At the core of our approach lies a novel multilingual query expansion mechanism based on Word Sense Disambiguation that provides sense definitions as additional semantic information for the query. Importantly, we use senses as a bridge across languages, thus allowing our model to perform considerably better than its supervised and unsupervised alternatives across French, German, Italian and Spanish languages on several CLEF benchmarks, while being trained on English Robust04 data only. We release SIR at https://github.com/SapienzaNLP/sir