3,336 research outputs found
Automatic semantic role labeling for European Portuguese
Dissertação de mestrado, Ciências da Linguagem, Faculdade de Ciências Humanas e Sociais, Universidade do Algarve, 2014This thesis addresses the task of Semantic Role Labeling (SRL) in European Portuguese. SRL can be
used in a number of NLP application, namely Anaphora Resolution, Question Answering, Summarization,
etc.
A general-purpose, consensual set of 37 semantic roles was defined, based on a survey of the relevant
related work, and using highly reproducible properties. A set of annotation guidelines was also built, in
order to clarify how semantic roles should be assigned to verbal arguments in context.
A SRL module was built and integrated in a fully-fledged Natural Language Processing (NLP) chain,
named STRING, developed at INESC-ID Lisboa.
For this module, the information from a lexicon-syntactic database, ViPEr, which contains the relevant
linguistic information for more than 6,000 European Portuguese full (or lexical, or distributional) verbs,
was used and the database manually enriched with the information pertaining to the semantic roles of
all verbal arguments.
The SRL module is composed of 183 pattern-matching rules for labeling of subject (N0), first (N1) and
second (N2) essential complements of verbal constructions and also allows the attribution of SR to other
syntactic slots in the case of time, locative, manner, instrumental, comitative and other complements
(both essential and circumstantial).
This module was tested in a small corpus that was specifically annotated for this purpose. After this
manual annotation, the corpus containing 655 semantic roles was used as a golden standard for automatic
comparison with the system’s output.
Considering that the SRL module operates at the last stages of the processing chain, a relatively
high precision was achieved (69.9% in a strict evaluation and 77.7%, when evaluation included partial
matches), though the recall was low (17.9%), which calls for future improvements.Esta tese aborda a tarefa de Anotação de Papéis Semânticos (APS) em Português Europeu. A APS
pode ser usada em diversas aplicações de Processamento de Linguagem Natural (PLN) tais como, Resolução
de Anáforas, Recuperação/Extração de Informação, Sumarização Automática, etc. Um conjunto
de 37 papéis semânticos, consensual e de uso geral, foi definido com base nos trabalhos relacionados
relevantes e recorrendo a propriedades suficientemente reprodutíveis.
Foi também elaborado um conjunto de diretrizes de anotação, a fim de esclarecer como deveriam ser
atribuídos aos argumentos verbais, em contexto, os seus respetivos papéis semânticos.
Com base nestes elementos, foi construído um módulo de APS, que se encontra integrado na cadeia
de Processamento de Linguagem Natural STRING, desenvolvida no INESC-ID Lisboa.
Para este módulo, foram utilizadas as informações de um banco de dados léxico-sintáticos, ViPEr, que
contém a informação linguística relevante para mais de 6.000 verbos plenos (ou lexicais, ou distribucionais)
do Português Europeu, e a base de dados foi enriquecida manualmente com as informações referentes ao
papéis semânticos de todos os argumentos verbais (sujeito e complementos essenciais).
O módulo de APS é composto por 183 regras de correspondência de padrões para a marcação de
sujeito (N0), primeiro (N1) e segundo (N2) complementos essenciais das construções verbais, e também
permite a atribuição de papéis semânticos para outros constituintes sintáticos, adjuntos adverbiais, tais
como os complementos de tempo, de modo, os complementos locativos, instrumentais, comitativos, entre
outros (tanto essenciais como circunstanciais).
Este módulo foi testado num corpus de textos reais, de natureza tipológica variada e abordando
diversos tópicos, o qual foi manualmente anotado por dois linguistas especificamente para este propósito.
Após esse processo de anotação manual, o corpus, que contém 655 papéis semânticos, foi usado como um
corpus de referência (golden standard) para a comparação automática com a saída do sistema.
Considerando-se que o módulo de APS opera nos últimos passos da cadeia de processamento, foi
alcançada uma precisão relativamente alta (69,9 % em uma avaliação estrita e 77,7 %, quando a avaliação
inclui resultados parciais), embora a abrangência (ou recall) tenha sido baixa (17,9 %), o que deverá
constituir um dos objetivos do trabalho futuro
Evaluating the semantic web: a task-based approach
The increased availability of online knowledge has led to the design of several algorithms that solve a variety of tasks by harvesting the Semantic Web, i.e. by dynamically selecting and exploring a multitude of online ontologies. Our hypothesis is that the performance of such novel algorithms implicity provides an insight into the quality of the used ontologies and thus opens the way to a task-based evaluation of the Semantic Web. We have investigated this hypothesis by studying the lessons learnt about online ontologies when used to solve three tasks: ontology matching, folksonomy enrichment, and word sense disambiguation. Our analysis leads to a suit of conclusions about the status of the Semantic Web, which highlight a number of strengths and weaknesses of the semantic information available online and complement the findings of other analysis of the Semantic Web landscape
VerbAtlas: a novel large-scale verbal semantic resource and its application to semantic role labeling
We present VerbAtlas, a new, hand-crafted lexical-semantic resource whose goal is to bring together all verbal synsets from WordNet into semantically-coherent frames. The frames define a common, prototypical argument structure while at the same time providing new concept-specific information. In contrast to PropBank, which defines enumerative semantic roles, VerbAtlas comes with an explicit, cross-frame set of semantic roles linked to selectional preferences expressed in terms of WordNet synsets, and is the first resource enriched with semantic information about implicit, shadow, and default arguments.
We demonstrate the effectiveness of VerbAtlas in the task of dependency-based Semantic Role Labeling and show how its integration into a high-performance system leads to improvements on both the in-domain and out-of-domain test sets of CoNLL-2009. VerbAtlas is available at http://verbatlas.org
Cross-Lingual Semantic Role Labeling with High-Quality Translated Training Corpus
Many efforts of research are devoted to semantic role labeling (SRL) which is
crucial for natural language understanding. Supervised approaches have achieved
impressing performances when large-scale corpora are available for
resource-rich languages such as English. While for the low-resource languages
with no annotated SRL dataset, it is still challenging to obtain competitive
performances. Cross-lingual SRL is one promising way to address the problem,
which has achieved great advances with the help of model transferring and
annotation projection. In this paper, we propose a novel alternative based on
corpus translation, constructing high-quality training datasets for the target
languages from the source gold-standard SRL annotations. Experimental results
on Universal Proposition Bank show that the translation-based method is highly
effective, and the automatic pseudo datasets can improve the target-language
SRL performances significantly.Comment: Accepted at ACL 202
Knowledge Representation of Crime-Related Events: a Preliminary Approach
The crime is spread in every daily newspaper, and particularly on criminal investigation reports produced by several Police departments, creating an amount of data to be processed by Humans. Other research studies related to relation extraction (a branch of information retrieval) in Portuguese arisen along the years, but with few extracted relations and several computer methods approaches, that could be improved by recent features, to achieve better performance results.
This paper aims to present the ongoing work related to SEM (Simple Event Model) ontology population with instances retrieved from crime-related documents, supported by an SVO (Subject, Verb, Object) algorithm using hand-crafted rules to extract events, achieving a performance measure of 0.86 (F-Measure)
Multilingual Twitter Sentiment Classification: The Role of Human Annotators
What are the limits of automated Twitter sentiment classification? We analyze
a large set of manually labeled tweets in different languages, use them as
training data, and construct automated classification models. It turns out that
the quality of classification models depends much more on the quality and size
of training data than on the type of the model trained. Experimental results
indicate that there is no statistically significant difference between the
performance of the top classification models. We quantify the quality of
training data by applying various annotator agreement measures, and identify
the weakest points of different datasets. We show that the model performance
approaches the inter-annotator agreement when the size of the training set is
sufficiently large. However, it is crucial to regularly monitor the self- and
inter-annotator agreements since this improves the training datasets and
consequently the model performance. Finally, we show that there is strong
evidence that humans perceive the sentiment classes (negative, neutral, and
positive) as ordered
PRIVAFRAME: A Frame-Based Knowledge Graph for Sensitive Personal Data
The pervasiveness of dialogue systems and virtual conversation applications raises an important theme: the potential of sharing sensitive information, and the consequent need for protection. To guarantee the subject’s right to privacy, and avoid the leakage of private content, it is important to treat sensitive information. However, any treatment requires firstly to identify sensitive text, and appropriate techniques to do it automatically. The Sensitive Information Detection (SID) task has been explored in the literature in different domains and languages, but there is no common benchmark. Current approaches are mostly based on artificial neural networks (ANN) or transformers based on them. Our research focuses on identifying categories of personal data in informal English sentences, by adopting a new logical-symbolic approach, and eventually hybridising it with ANN models. We present a frame-based knowledge graph built for personal data categories defined in the Data Privacy Vocabulary (DPV). The knowledge graph is designed through the logical composition of already existing frames, and has been evaluated as background knowledge for a SID system against a labeled sensitive information dataset. The accuracy of PRIVAFRAME reached 78%. By comparison, a transformer-based model achieved 12% lower performance on the same dataset. The top-down logical-symbolic frame-based model allows a granular analysis, and does not require a training dataset. These advantages lead us to use it as a layer in a hybrid model, where the logical SID is combined with an ANNs SID tested in a previous study by the authors
Cross-Lingual Induction and Transfer of Verb Classes Based on Word Vector Space Specialisation
Existing approaches to automatic VerbNet-style verb classification are
heavily dependent on feature engineering and therefore limited to languages
with mature NLP pipelines. In this work, we propose a novel cross-lingual
transfer method for inducing VerbNets for multiple languages. To the best of
our knowledge, this is the first study which demonstrates how the architectures
for learning word embeddings can be applied to this challenging
syntactic-semantic task. Our method uses cross-lingual translation pairs to tie
each of the six target languages into a bilingual vector space with English,
jointly specialising the representations to encode the relational information
from English VerbNet. A standard clustering algorithm is then run on top of the
VerbNet-specialised representations, using vector dimensions as features for
learning verb classes. Our results show that the proposed cross-lingual
transfer approach sets new state-of-the-art verb classification performance
across all six target languages explored in this work.Comment: EMNLP 2017 (long paper
The precedence of syntax in the rapid emergence of human language in evolution as defined by the integration hypothesis
Our core hypothesis is that the emergence of human language arose very rapidly from the linking of two pre-adapted systems found elsewhere in the animal world—an expression system, found, for example, in birdsong, and a lexical system, suggestively found in non-human primate calls (Miyagawa et al., 2013, 2014). We challenge the view that language has undergone a series of gradual changes—or a single preliminary protolinguistic stage—before achieving its full character. We argue that a full-fledged combinatorial operation Merge triggered the integration of these two pre-adapted systems, giving rise to a fully developed language. This goes against the gradualist view that there existed a structureless, protolinguistic stage, in which a rudimentary proto-Merge operation generated internally flat words. It is argued that compounds in present-day language are a fossilized form of this prior stage, a point which we will question
- …