302 research outputs found
Semantic Relation Extraction. Resources, Tools and Strategies
[Abstract] Relation extraction is a subtask of information extraction that aims at obtaining instances of semantic relations present in texts. This information can be arranged in machine-readable formats, useful for several applications that need structured semantic knowledge. The work presented in this paper explores different strategies to automate the extraction of semantic relations from texts in Portuguese, Galician and Spanish. Both machine learning (distant-supervised and supervised) and rule-based techniques are investigated, and the impact of the different levels of linguistic knowledge is analyzed for the various approaches. Regarding domains, the experiments are focused on the extraction of encyclopedic knowledge, by means of the development of biographical relations classifiers (in a closed domain) and the evaluation of an open information extraction tool. To implement the extraction systems, several natural language processing tools have been built for the three research languages: From sentence splitting and tokenization modules to part-of-speech taggers, named entity recognizers and coreference resolution systems. Furthermore, several lexica and corpora have been compiled and enriched with different levels of linguistic annotation, which are useful for both training and testing probabilistic and symbolic models. As a result of the performed work, new resources and tools are available for automated processing of texts in Portuguese, Galician and Spanish.Ministerio de Economía y Competitividad; FFI2014-51978-C2-1-RMinisterio de Economía y Competitividad; FJCI-2014-2285
Linguistics parameters for zero anaphora resolution
Dissertação de mest., Natural Language Processing and Human Language Technology, Univ. do Algarve, 2009This dissertation describes and proposes a set of linguistically motivated rules for zero
anaphora resolution in the context of a natural language processing chain developed for
Portuguese. Some languages, like Portuguese, allow noun phrase (NP) deletion (or zeroing)
in several syntactic contexts in order to avoid the redundancy that would result from
repetition of previously mentioned words. The co-reference relation between the zeroed
element and its antecedent (or previous mention) in the discourse is here called zero
anaphora (Mitkov, 2002). In Computational Linguistics, zero anaphora resolution may be
viewed as a subtask of anaphora resolution and has an essential role in various Natural
Language Processing applications such as information extraction, automatic abstracting,
dialog systems, machine translation and question answering. The main goal of this
dissertation is to describe the grammatical rules imposing subject NP deletion and referential
constraints in the Brazilian Portuguese, in order to allow a correct identification of the
antecedent of the deleted subject NP. Some of these rules were then formalized into the
Xerox Incremental Parser or XIP (Ait-Mokhtar et al., 2002: 121-144) in order to constitute a
module of the Portuguese grammar (Mamede et al. 2010) developed at Spoken Language
Laboratory (L2F). Using this rule-based approach we expected to improve the performance
of the Portuguese grammar namely by producing better dependency structures with
(reconstructed) zeroed NPs for the syntactic-semantic interface. Because of the complexity
of the task, the scope of this dissertation had to be limited: (a) subject NP deletion; b) within
sentence boundaries and (c) with an explicit antecedent; besides, (d) rules were formalized
based solely on the results of the shallow parser (or chunks), that is, with minimal syntactic
(and no semantic) knowledge. A corpus of different text genres was manually annotated for
zero anaphors and other zero-shaped, usually indefinite, subjects. The rule-based
approached is evaluated and results are presented and discussed
Towards Multilingual Coreference Resolution
The current work investigates the problems that occur when coreference resolution is considered as a multilingual task. We assess the issues that arise when a framework using the mention-pair coreference resolution model and memory-based learning for the resolution process are used. Along the way, we revise three essential subtasks of coreference resolution: mention detection, mention head detection and feature selection. For each of these aspects we propose various multilingual solutions including both heuristic, rule-based and machine learning methods. We carry out a detailed analysis that includes eight different languages (Arabic, Catalan, Chinese, Dutch, English, German, Italian and Spanish) for which datasets were provided by the only two multilingual shared tasks on coreference resolution held so far: SemEval-2 and CoNLL-2012. Our investigation shows that, although complex, the coreference resolution task can be targeted in a multilingual and even language independent way. We proposed machine learning methods for each of the subtasks that are affected by the transition, evaluated and compared them to the performance of rule-based and heuristic approaches. Our results confirmed that machine learning provides the needed flexibility for the multilingual task and that the minimal requirement for a language independent system is a part-of-speech annotation layer provided for each of the approached languages. We also showed that the performance of the system can be improved by introducing other layers of linguistic annotations, such as syntactic parses (in the form of either constituency or dependency parses), named entity information, predicate argument structure, etc. Additionally, we discuss the problems occurring in the proposed approaches and suggest possibilities for their improvement
Gender-specific Machine Translation with Large Language Models
Decoder-only Large Language Models (LLMs) have demonstrated potential in
machine translation (MT), albeit with performance slightly lagging behind
traditional encoder-decoder Neural Machine Translation (NMT) systems. However,
LLMs offer a unique advantage: the ability to control the properties of the
output through prompts. In this study, we harness this flexibility to explore
LLaMa's capability to produce gender-specific translations for languages with
grammatical gender. Our results indicate that LLaMa can generate
gender-specific translations with competitive accuracy and gender bias
mitigation when compared to NLLB, a state-of-the-art multilingual NMT system.
Furthermore, our experiments reveal that LLaMa's translations are robust,
showing significant performance drops when evaluated against opposite-gender
references in gender-ambiguous datasets but maintaining consistency in less
ambiguous contexts. This research provides insights into the potential and
challenges of using LLMs for gender-specific translations and highlights the
importance of in-context learning to elicit new tasks in LLMs
Collaborative relation annotation and quality analysis in Markyt environment
Text mining is showing potential to help in biomedical knowledge integration and discovery at various levels. However, results depend largely on the specifics of the knowledge problem and, in particular, on the ability to produce high-quality benchmarking corpora that may support the training and evaluation of automatic prediction systems. Annotation tools enabling the flexible and customizable production of such corpora are thus pivotal. The open-source Markyt annotation environment brings together the latest web technologies to offer a wide range of annotation capabilities in a domain-agnostic way. It enables the management of multi-user and multi-round annotation projects, including inter-annotator agreement and consensus assessments. Also, Markyt supports the description of entity and relation annotation guidelines on a project basis, being flexible to partial word tagging and the occurrence of annotation overlaps. This paper describes the current release of Markyt, namely new annotation perspectives, which enable the annotation of relations among entities, and enhanced analysis capabilities. Several demos, inspired by public biomedical corpora, are presented as means to better illustrate such functionalities. Markyt aims to bring together annotation capabilities of broad interest to those producing annotated corpora. Markyt demonstration projects describe 20 different annotation tasks of varied document sources (e.g. abstracts, twitters or drug labels) and languages (e.g. English, Spanish or Chinese). Continuous development is based on feedback from practical applications as well as community reports on short- and medium-term mining challenges. Markyt is freely available for non-commercial use at http://markyt.org.This work was partially supported by the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). The authors also acknowledge the PhD grants of M.P.-P. and G.P.-R., funded by the Xunta de Galicia.info:eu-repo/semantics/publishedVersio
Rapport : a fact-based question answering system for portuguese
Question answering is one of the longest-standing problems in natural language processing. Although natural language interfaces for computer systems can be considered
more common these days, the same still does not happen regarding access to specific
textual information. Any full text search engine can easily retrieve documents containing user specified or closely related terms, however it is typically unable to answer user
questions with small passages or short answers.
The problem with question answering is that text is hard to process, due to its syntactic structure and, to a higher degree, to its semantic contents. At the sentence level,
although the syntactic aspects of natural language have well known rules, the size and
complexity of a sentence may make it difficult to analyze its structure. Furthermore, semantic aspects are still arduous to address, with text ambiguity being one of the hardest
tasks to handle. There is also the need to correctly process the question in order to define its target, and then select and process the answers found in a text. Additionally, the
selected text that may yield the answer to a given question must be further processed
in order to present just a passage instead of the full text. These issues take also longer
to address in languages other than English, as is the case of Portuguese, that have a lot
less people working on them.
This work focuses on question answering for Portuguese. In other words, our field
of interest is in the presentation of short answers, passages, and possibly full sentences,
but not whole documents, to questions formulated using natural language. For that purpose, we have developed a system, RAPPORT, built upon the use of open information
extraction techniques for extracting triples, so called facts, characterizing information
on text files, and then storing and using them for answering user queries done in natural language. These facts, in the form of subject, predicate and object, alongside other
metadata, constitute the basis of the answers presented by the system. Facts work both
by storing short and direct information found in a text, typically entity related information, and by containing in themselves the answers to the questions already in the
form of small passages. As for the results, although there is margin for improvement,
they are a tangible proof of the adequacy of our approach and its different modules for
storing information and retrieving answers in question answering systems.
In the process, in addition to contributing with a new approach to question answering for Portuguese, and validating the application of open information extraction to
question answering, we have developed a set of tools that has been used in other natural language processing related works, such as is the case of a lemmatizer, LEMPORT,
which was built from scratch, and has a high accuracy. Many of these tools result from
the improvement of those found in the Apache OpenNLP toolkit, by pre-processing their
input, post-processing their output, or both, and by training models for use in those
tools or other, such as MaltParser. Other tools include the creation of interfaces for
other resources containing, for example, synonyms, hypernyms, hyponyms, or the creation of lists of, for instance, relations between verbs and agents, using rules
Robustness in Coreference Resolution
Coreference resolution is the task of determining different expressions of a text that refer to the same entity. The resolution of coreferring expressions is an essential step for automatic interpretation of the text. While coreference information is beneficial for various NLP tasks like summarization, question answering, and information extraction, state-of-the-art coreference resolvers are barely used in any of these tasks. The problem is the lack of robustness in coreference resolution systems. A coreference resolver that gets higher scores on the standard
evaluation set does not necessarily perform better than the others on a new test set.
In this thesis, we introduce robustness in coreference resolution by (1) introducing a reliable evaluation framework for recognizing robust improvements, and (2) proposing a solution that results in robust coreference resolvers.
As the first step of setting up the evaluation framework, we introduce a reliable evaluation metric, called LEA, that overcomes the drawbacks of the existing metrics. We analyze LEA based on various types of errors in coreference outputs and show that it results in reliable scores. In addition to an evaluation metric, we also introduce an evaluation setting in which we disentangle coreference evaluations from parsing complexities. Coreference resolution is affected by parsing complexities for detecting the boundaries of expressions that have complex syntactic structures. We reduce the effect of parsing errors in coreference evaluation by automatically extracting a minimum span for each expression. We then emphasize the importance of out-of-domain evaluations and generalization in coreference resolution and discuss the reasons behind the poor generalization of state-of-the-art coreference resolvers.
Finally, we show that enhancing state-of-the-art coreference resolvers with linguistic features is a promising approach for making coreference resolvers robust across domains. The
incorporation of linguistic features with all their values does not improve the performance.
However, we introduce an efficient pattern mining approach, called EPM, that mines all feature-value combinations that are discriminative for coreference relations. We then only
incorporate feature-values that are discriminative for coreference relations. By employing EPM feature-values, performance improves significantly across various domains
Extracção de relações semânticas. Recursos, ferramentas e estratégias
A extracção de relações, enquadrada dentro das tarefas de extracção de informação, visa
obter automaticamente exemplos de relações semânticas presentes em textos. Esta
informação pode ser posteriormente organizada em formatos legíveis por computadores,
sendo útil para diversas aplicações que necessitem conhecimento semântico estruturado.
A presente tese avalia diferentes estratégias para a extracção automática de relações
semânticas de textos em português, espanhol e galego. Com esse fim, são utilizadas
tanto técnicas de aprendizagem automática (com supervisãodistante
e supervisionadas)
como sistemas baseados em regras, sendo analisado o impacto de diferentes níveis de
conhecimento linguístico nas várias abordagens avaliadas. Em relação ao domínio, as
extracções lidam com conhecimento de carácter enciclopédico, mediante a criação de
classificadores de relações biográficas (em domínio fechado) e a avaliação de sistemas
de extracção de informação aberta.
Com o objectivo de implementar os sistemas de extracção, foram também construídas
diversas ferramentas para o processamento da linguagem natural nos três idiomas
referidos: desde módulos de segmentação de orações e de tokenização, a sistemas de
desambiguação morfossintáctica, de reconhecimento de entidades mencionadas e de
resolução de correferência. Além disso, foram compilados e adaptados léxicos e corpora
com anotação linguística de diferentes níveis, úteis para o treino e avaliação de modelos
probabilísticos e baseados em regras. Como resultado do trabalho realizado nesta tese,
disponibilizamse
novas ferramentas e recursos para o processamento automático de
textos em português, espanhol e galego
- …