54 research outputs found
Semantic relations between sentences: from lexical to linguistically inspired semantic features and beyond
This thesis is concerned with the identification of semantic equivalence between pairs of natural language
sentences, by studying and computing models to address Natural Language Processing tasks where some
form of semantic equivalence is assessed. In such tasks, given two sentences, our models output either
a class label, corresponding to the semantic relation between the sentences, based on a predefined set
of semantic relations, or a continuous score, corresponding to their similarity on a predefined scale. The
former setup corresponds to the tasks of Paraphrase Identification and Natural Language Inference, while
the latter corresponds to the task of Semantic Textual Similarity.
We present several models for English and Portuguese, where various types of features are considered,
for instance based on distances between alternative representations of each sentence, following lexical
and semantic frameworks, or embeddings from pre-trained Bidirectional Encoder Representations from
Transformers models. For English, a new set of semantic features is proposed, from the formal semantic
representation of Discourse Representation Structure. In Portuguese, suitable corpora are scarce and formal
semantic representations are unavailable, hence an evaluation of currently available features and corpora is
conducted, following the modelling setup employed for English.
Competitive results are achieved on all tasks, for both English and Portuguese, particularly when considering
that our models are based on generally available tools and technologies, and that all features and models are
suitable for computation in most modern computers, except for those based on embeddings. In particular,
for English, our semantic features from DRS are able to improve the performance of other models, when
integrated in the feature set of such models, and state of the art results are achieved for Portuguese, with
models based on fine tuning embeddings to a specific task; Sumário:
Relações semânticas entre frases: de aspectos
lexicais a aspectos semânticos inspirados em
linguística e além destes
Esta tese é dedicada à identificação de equivalência semântica entre frases em língua natural, através do
estudo e computação de modelos destinados a tarefas de Processamento de Linguagem Natural relacionadas
com alguma forma de equivalência semântica. Em tais tarefas, a partir de duas frases, os nossos modelos
produzem uma etiqueta de classificação, que corresponde à relação semântica entre as frases, baseada
num conjunto predefinido de possíveis relações semânticas, ou um valor contínuo, que corresponde à
similaridade das frases numa escala predefinida. A primeira configuração mencionada corresponde às tarefas
de Identificação de Paráfrases e de Inferência em Língua Natural, enquanto que a última configuração
mencionada corresponde à tarefa de Similaridade Semântica em Texto.
Apresentamos diversos modelos para Inglês e Português, onde vários tipos de aspectos são considerados,
por exemplo baseados em distâncias entre representações alternativas para cada frase, seguindo formalismos
semânticos e lexicais, ou vectores contextuais de modelos previamente treinados com Representações
Codificadas Bidirecionalmente a partir de Transformadores. Para Inglês, propomos um novo conjunto de
aspectos semânticos, a partir da representação formal de semântica em Estruturas de Representação de
Discurso. Para Português, os conjuntos de dados apropriados são escassos e não estão disponíveis representações
formais de semântica, então implementámos uma avaliação de aspectos actualmente disponíveis,
seguindo a configuração de modelos aplicada para Inglês.
Obtivemos resultados competitivos em todas as tarefas, em Inglês e Português, particularmente considerando
que os nossos modelos são baseados em ferramentas e tecnologias disponíveis, e que todos
os nossos aspectos e modelos são apropriados para computação na maioria dos computadores modernos,
excepto os modelos baseados em vectores contextuais. Em particular, para Inglês, os nossos aspectos
semânticos a partir de Estruturas de Representação de Discurso melhoram o desempenho de outros modelos,
quando integrados no conjunto de aspectos de tais modelos, e obtivemos resultados estado da arte
para Português, com modelos baseados em afinação de vectores contextuais para certa tarefa
Information fusion for automated question answering
Until recently, research efforts in automated Question Answering (QA) have mainly
focused on getting a good understanding of questions to retrieve correct answers. This
includes deep parsing, lookups in ontologies, question typing and machine learning
of answer patterns appropriate to question forms. In contrast, I have focused on the
analysis of the relationships between answer candidates as provided in open domain
QA on multiple documents. I argue that such candidates have intrinsic properties,
partly regardless of the question, and those properties can be exploited to provide better
quality and more user-oriented answers in QA.Information fusion refers to the technique of merging pieces of information from
different sources. In QA over free text, it is motivated by the frequency with which
different answer candidates are found in different locations, leading to a multiplicity
of answers. The reason for such multiplicity is, in part, the massive amount of data
used for answering, and also its unstructured and heterogeneous content: Besides am¬
biguities in user questions leading to heterogeneity in extractions, systems have to deal
with redundancy, granularity and possible contradictory information. Hence the need
for answer candidate comparison. While frequency has proved to be a significant char¬
acteristic of a correct answer, I evaluate the value of other relationships characterizing
answer variability and redundancy.Partially inspired by recent developments in multi-document summarization, I re¬
define the concept of "answer" within an engineering approach to QA based on the
Model-View-Controller (MVC) pattern of user interface design. An "answer model"
is a directed graph in which nodes correspond to entities projected from extractions
and edges convey relationships between such nodes. The graph represents the fusion
of information contained in the set of extractions. Different views of the answer model
can be produced, capturing the fact that the same answer can be expressed and pre¬
sented in various ways: picture, video, sound, written or spoken language, or a formal
data structure. Within this framework, an answer is a structured object contained in the
model and retrieved by a strategy to build a particular view depending on the end user
(or taskj's requirements.I describe shallow techniques to compare entities and enrich the model by discovering four broad categories of relationships between entities in the model: equivalence,
inclusion, aggregation and alternative. Quantitatively, answer candidate modeling im¬
proves answer extraction accuracy. It also proves to be more robust to incorrect answer
candidates than traditional techniques. Qualitatively, models provide meta-information
encoded by relationships that allow shallow reasoning to help organize and generate
the final output
MVP: Multi-task Supervised Pre-training for Natural Language Generation
Pre-trained language models (PLMs) have achieved remarkable success in
natural language generation (NLG) tasks. Up to now, most NLG-oriented PLMs are
pre-trained in an unsupervised manner using the large-scale general corpus. In
the meanwhile, an increasing number of models pre-trained with labeled data
(i.e. "supervised pre-training") showcase superior performance compared to
unsupervised pre-trained models. Motivated by the success of supervised
pre-training, we propose Multi-task superVised Pre-training (MVP) for natural
language generation. We collect a large-scale natural language generation
corpus, MVPCorpus, from datasets over diverse NLG tasks. Then we
unify these examples into a general text-to-text format to pre-train the text
generation model MVP in a supervised manner. For each task, we further
pre-train specific soft prompts to stimulate the model's capacity to perform a
specific task. Our MVP model can be seen as a practice that utilizes recent
instruction tuning on relatively small PLMs. Extensive experiments have
demonstrated the effectiveness and generality of our MVP model in a number of
NLG tasks, which achieves state-of-the-art performance on out of
datasets, outperforming BART by and Flan-T5 by .Comment: Accepted by ACL 202
Recommended from our members
Acquiring and Harnessing Verb Knowledge for Multilingual Natural Language Processing
Advances in representation learning have enabled natural language processing models to derive non-negligible linguistic information directly from text corpora in an unsupervised fashion. However, this signal is underused in downstream tasks, where they tend to fall back on superficial cues and heuristics to solve the problem at hand. Further progress relies on identifying and filling the gaps in linguistic knowledge captured in their parameters. The objective of this thesis is to address these challenges focusing on the issues of resource scarcity, interpretability, and lexical knowledge injection, with an emphasis on the category of verbs.
To this end, I propose a novel paradigm for efficient acquisition of lexical knowledge leveraging native speakers’ intuitions about verb meaning to support development and downstream performance of NLP models across languages. First, I investigate the potential of acquiring semantic verb classes from non-experts through manual clustering. This subsequently informs the development of a two-phase semantic dataset creation methodology, which combines semantic clustering with fine-grained semantic similarity judgments collected through spatial arrangements of lexical stimuli. The method is tested on English and then applied to a typologically diverse sample of languages to produce the first large-scale multilingual verb dataset of this kind. I demonstrate its utility as a diagnostic tool by carrying out a comprehensive evaluation of state-of-the-art NLP models, probing representation quality across languages and domains of verb meaning, and shedding light on their deficiencies. Subsequently, I directly address these shortcomings by injecting lexical knowledge into large pretrained language models. I demonstrate that external manually curated information about verbs’ lexical properties can support data-driven models in tasks where accurate verb processing is key. Moreover, I examine the potential of extending these benefits from resource-rich to resource-poor languages through translation-based transfer. The results emphasise the usefulness of human-generated lexical knowledge in supporting NLP models and suggest that time-efficient construction of lexicons similar to those developed in this work, especially in under-resourced languages, can play an important role in boosting their linguistic capacity.ESRC Doctoral Fellowship [ES/J500033/1], ERC Consolidator Grant LEXICAL [648909
- …