108 research outputs found
Bootstrapping Multilingual AMR with Contextual Word Alignments
We develop high performance multilingualAbstract Meaning Representation (AMR)
sys-tems by projecting English AMR annotationsto other languages with weak
supervision. Weachieve this goal by bootstrapping transformer-based
multilingual word embeddings, in partic-ular those from cross-lingual RoBERTa
(XLM-R large). We develop a novel technique forforeign-text-to-English AMR
alignment, usingthe contextual word alignment between En-glish and foreign
language tokens. This wordalignment is weakly supervised and relies onthe
contextualized XLM-R word embeddings.We achieve a highly competitive
performancethat surpasses the best published results forGerman, Italian,
Spanish and Chinese
Abstract syntax as interlingua: Scaling up the grammatical framework from controlled languages to robust pipelines
Syntax is an interlingual representation used in compilers. Grammatical Framework (GF) applies the abstract syntax idea to natural languages. The development of GF started in 1998, first as a tool for controlled language implementations, where it has gained an established position in both academic and commercial projects. GF provides grammar resources for over 40 languages, enabling accurate generation and translation, as well as grammar engineering tools and components for mobile and Web applications. On the research side, the focus in the last ten years has been on scaling up GF to wide-coverage language processing. The concept of abstract syntax offers a unified view on many other approaches: Universal Dependencies, WordNets, FrameNets, Construction Grammars, and Abstract Meaning Representations. This makes it possible for GF to utilize data from the other approaches and to build robust pipelines. In return, GF can contribute to data-driven approaches by methods to transfer resources from one language to others, to augment data by rule-based generation, to check the consistency of hand-annotated corpora, and to pipe analyses into high-precision semantic back ends. This article gives an overview of the use of abstract syntax as interlingua through both established and emerging NLP applications involving GF
Widely Interpretable Semantic Representation: Frameless Meaning Representation for Broader Applicability
This paper presents a novel semantic representation, WISeR, that overcomes
challenges for Abstract Meaning Representation (AMR). Despite its strengths,
AMR is not easily applied to languages or domains without predefined semantic
frames, and its use of numbered arguments results in semantic role labels,
which are not directly interpretable and are semantically overloaded for
parsers. We examine the numbered arguments of predicates in AMR and convert
them to thematic roles that do not require reference to semantic frames. We
create a new corpus of 1K English dialogue sentences annotated in both WISeR
and AMR. WISeR shows stronger inter-annotator agreement for beginner and
experienced annotators, with beginners becoming proficient in WISeR annotation
more quickly. Finally, we train a state-of-the-art parser on the AMR 3.0 corpus
and a WISeR corpus converted from AMR 3.0. The parser is evaluated on these
corpora and our dialogue corpus. The WISeR model exhibits higher accuracy than
its AMR counterpart across the board, demonstrating that WISeR is easier for
parsers to learn
Character-based Neural Semantic Parsing
Humans and computers do not speak the same language. A lot of day-to-day tasks would be vastly more efficient if we could communicate with computers using natural language instead of relying on an interface. It is necessary, then, that the computer does not see a sentence as a collection of individual words, but instead can understand the deeper, compositional meaning of the sentence. A way to tackle this problem is to automatically assign a formal, structured meaning representation to each sentence, which are easy for computers to interpret. There have been quite a few attempts at this before, but these approaches were usually heavily reliant on predefined rules, word lists or representations of the syntax of the text. This made the general usage of these methods quite complicated. In this thesis we employ an algorithm that can learn to automatically assign meaning representations to texts, without using any such external resource. Specifically, we use a type of artificial neural network called a sequence-to-sequence model, in a process that is often referred to as deep learning. The devil is in the details, but we find that this type of algorithm can produce high quality meaning representations, with better performance than the more traditional methods. Moreover, a main finding of the thesis is that, counter intuitively, it is often better to represent the text as a sequence of individual characters, and not words. This is likely the case because it helps the model in dealing with spelling errors, unknown words and inflections
The AMR-PT corpus and the semantic annotation of challenging sentences from journalistic and opinion texts
ABSTRACT One of the most popular semantic representation languages in Natural Language Processing (NLP) is Abstract Meaning Representation (AMR). This formalism encodes the meaning of single sentences in directed rooted graphs. For English, there is a large annotated corpus that provides qualitative and reusable data for building or improving existing NLP methods and applications. For building AMR corpora for non-English languages, including Brazilian Portuguese, automatic and manual strategies have been conducted. The automatic annotation methods are essentially based on the cross-linguistic alignment of parallel corpora and the inheritance of the AMR annotation. The manual strategies focus on adapting the AMR English guidelines to a target language. Both annotation strategies have to deal with some phenomena that are challenging. This paper explores in detail some characteristics of Portuguese for which the AMR model had to be adapted and introduces two annotated corpora: AMRNews, a corpus of 870 annotated sentences from journalistic texts, and OpiSums-PT-AMR, comprising 404 opinionated sentences in AMR
Understanding and generating language with abstract meaning representation
Abstract Meaning Representation (AMR) is a semantic representation for natural
language that encompasses annotations related to traditional tasks such as
Named Entity Recognition (NER), Semantic Role Labeling (SRL), word sense
disambiguation (WSD), and Coreference Resolution. AMR represents sentences
as graphs, where nodes represent concepts and edges represent semantic
relations between them.
Sentences are represented as graphs and not trees because nodes can have
multiple incoming edges, called reentrancies. This thesis investigates the impact
of reentrancies for parsing (from text to AMR) and generation (from AMR
to text). For the parsing task, we showed that it is possible to use techniques
from tree parsing and adapt them to deal with reentrancies. To better analyze
the quality of AMR parsers, we developed a set of fine-grained metrics
and found that state-of-the-art parsers predict reentrancies poorly. Hence we
provided a classification of linguistic phenomena causing reentrancies, categorized
the type of errors parsers do with respect to reentrancies, and proved
that correcting these errors can lead to significant improvements. For the generation
task, we showed that neural encoders that have access to reentrancies
outperform those who do not, demonstrating the importance of reentrancies
also for generation.
This thesis also discusses the problem of using AMR for languages other
than English. Annotating new AMR datasets for other languages is an expensive
process and requires defining annotation guidelines for each new language.
It is therefore reasonable to ask whether we can share AMR annotations
across languages. We provided evidence that AMR datasets for English
can be successfully transferred to other languages: we trained parsers for Italian,
Spanish, German, and Chinese to investigate the cross-linguality of AMR.
We showed cases where translational divergences between languages pose a
problem and cases where they do not. In summary, this thesis demonstrates
the impact of reentrancies in AMR as well as providing insights on AMR for
languages that do not yet have AMR datasets
- …