3,336 research outputs found

    Automatic semantic role labeling for European Portuguese

    Get PDF
    Dissertação de mestrado, Ciências da Linguagem, Faculdade de Ciências Humanas e Sociais, Universidade do Algarve, 2014This thesis addresses the task of Semantic Role Labeling (SRL) in European Portuguese. SRL can be used in a number of NLP application, namely Anaphora Resolution, Question Answering, Summarization, etc. A general-purpose, consensual set of 37 semantic roles was defined, based on a survey of the relevant related work, and using highly reproducible properties. A set of annotation guidelines was also built, in order to clarify how semantic roles should be assigned to verbal arguments in context. A SRL module was built and integrated in a fully-fledged Natural Language Processing (NLP) chain, named STRING, developed at INESC-ID Lisboa. For this module, the information from a lexicon-syntactic database, ViPEr, which contains the relevant linguistic information for more than 6,000 European Portuguese full (or lexical, or distributional) verbs, was used and the database manually enriched with the information pertaining to the semantic roles of all verbal arguments. The SRL module is composed of 183 pattern-matching rules for labeling of subject (N0), first (N1) and second (N2) essential complements of verbal constructions and also allows the attribution of SR to other syntactic slots in the case of time, locative, manner, instrumental, comitative and other complements (both essential and circumstantial). This module was tested in a small corpus that was specifically annotated for this purpose. After this manual annotation, the corpus containing 655 semantic roles was used as a golden standard for automatic comparison with the system’s output. Considering that the SRL module operates at the last stages of the processing chain, a relatively high precision was achieved (69.9% in a strict evaluation and 77.7%, when evaluation included partial matches), though the recall was low (17.9%), which calls for future improvements.Esta tese aborda a tarefa de Anotação de Papéis Semânticos (APS) em Português Europeu. A APS pode ser usada em diversas aplicações de Processamento de Linguagem Natural (PLN) tais como, Resolução de Anáforas, Recuperação/Extração de Informação, Sumarização Automática, etc. Um conjunto de 37 papéis semânticos, consensual e de uso geral, foi definido com base nos trabalhos relacionados relevantes e recorrendo a propriedades suficientemente reprodutíveis. Foi também elaborado um conjunto de diretrizes de anotação, a fim de esclarecer como deveriam ser atribuídos aos argumentos verbais, em contexto, os seus respetivos papéis semânticos. Com base nestes elementos, foi construído um módulo de APS, que se encontra integrado na cadeia de Processamento de Linguagem Natural STRING, desenvolvida no INESC-ID Lisboa. Para este módulo, foram utilizadas as informações de um banco de dados léxico-sintáticos, ViPEr, que contém a informação linguística relevante para mais de 6.000 verbos plenos (ou lexicais, ou distribucionais) do Português Europeu, e a base de dados foi enriquecida manualmente com as informações referentes ao papéis semânticos de todos os argumentos verbais (sujeito e complementos essenciais). O módulo de APS é composto por 183 regras de correspondência de padrões para a marcação de sujeito (N0), primeiro (N1) e segundo (N2) complementos essenciais das construções verbais, e também permite a atribuição de papéis semânticos para outros constituintes sintáticos, adjuntos adverbiais, tais como os complementos de tempo, de modo, os complementos locativos, instrumentais, comitativos, entre outros (tanto essenciais como circunstanciais). Este módulo foi testado num corpus de textos reais, de natureza tipológica variada e abordando diversos tópicos, o qual foi manualmente anotado por dois linguistas especificamente para este propósito. Após esse processo de anotação manual, o corpus, que contém 655 papéis semânticos, foi usado como um corpus de referência (golden standard) para a comparação automática com a saída do sistema. Considerando-se que o módulo de APS opera nos últimos passos da cadeia de processamento, foi alcançada uma precisão relativamente alta (69,9 % em uma avaliação estrita e 77,7 %, quando a avaliação inclui resultados parciais), embora a abrangência (ou recall) tenha sido baixa (17,9 %), o que deverá constituir um dos objetivos do trabalho futuro

    Evaluating the semantic web: a task-based approach

    Get PDF
    The increased availability of online knowledge has led to the design of several algorithms that solve a variety of tasks by harvesting the Semantic Web, i.e. by dynamically selecting and exploring a multitude of online ontologies. Our hypothesis is that the performance of such novel algorithms implicity provides an insight into the quality of the used ontologies and thus opens the way to a task-based evaluation of the Semantic Web. We have investigated this hypothesis by studying the lessons learnt about online ontologies when used to solve three tasks: ontology matching, folksonomy enrichment, and word sense disambiguation. Our analysis leads to a suit of conclusions about the status of the Semantic Web, which highlight a number of strengths and weaknesses of the semantic information available online and complement the findings of other analysis of the Semantic Web landscape

    VerbAtlas: a novel large-scale verbal semantic resource and its application to semantic role labeling

    Get PDF
    We present VerbAtlas, a new, hand-crafted lexical-semantic resource whose goal is to bring together all verbal synsets from WordNet into semantically-coherent frames. The frames define a common, prototypical argument structure while at the same time providing new concept-specific information. In contrast to PropBank, which defines enumerative semantic roles, VerbAtlas comes with an explicit, cross-frame set of semantic roles linked to selectional preferences expressed in terms of WordNet synsets, and is the first resource enriched with semantic information about implicit, shadow, and default arguments. We demonstrate the effectiveness of VerbAtlas in the task of dependency-based Semantic Role Labeling and show how its integration into a high-performance system leads to improvements on both the in-domain and out-of-domain test sets of CoNLL-2009. VerbAtlas is available at http://verbatlas.org

    Cross-Lingual Semantic Role Labeling with High-Quality Translated Training Corpus

    Full text link
    Many efforts of research are devoted to semantic role labeling (SRL) which is crucial for natural language understanding. Supervised approaches have achieved impressing performances when large-scale corpora are available for resource-rich languages such as English. While for the low-resource languages with no annotated SRL dataset, it is still challenging to obtain competitive performances. Cross-lingual SRL is one promising way to address the problem, which has achieved great advances with the help of model transferring and annotation projection. In this paper, we propose a novel alternative based on corpus translation, constructing high-quality training datasets for the target languages from the source gold-standard SRL annotations. Experimental results on Universal Proposition Bank show that the translation-based method is highly effective, and the automatic pseudo datasets can improve the target-language SRL performances significantly.Comment: Accepted at ACL 202

    Knowledge Representation of Crime-Related Events: a Preliminary Approach

    Get PDF
    The crime is spread in every daily newspaper, and particularly on criminal investigation reports produced by several Police departments, creating an amount of data to be processed by Humans. Other research studies related to relation extraction (a branch of information retrieval) in Portuguese arisen along the years, but with few extracted relations and several computer methods approaches, that could be improved by recent features, to achieve better performance results. This paper aims to present the ongoing work related to SEM (Simple Event Model) ontology population with instances retrieved from crime-related documents, supported by an SVO (Subject, Verb, Object) algorithm using hand-crafted rules to extract events, achieving a performance measure of 0.86 (F-Measure)

    Multilingual Twitter Sentiment Classification: The Role of Human Annotators

    Get PDF
    What are the limits of automated Twitter sentiment classification? We analyze a large set of manually labeled tweets in different languages, use them as training data, and construct automated classification models. It turns out that the quality of classification models depends much more on the quality and size of training data than on the type of the model trained. Experimental results indicate that there is no statistically significant difference between the performance of the top classification models. We quantify the quality of training data by applying various annotator agreement measures, and identify the weakest points of different datasets. We show that the model performance approaches the inter-annotator agreement when the size of the training set is sufficiently large. However, it is crucial to regularly monitor the self- and inter-annotator agreements since this improves the training datasets and consequently the model performance. Finally, we show that there is strong evidence that humans perceive the sentiment classes (negative, neutral, and positive) as ordered

    PRIVAFRAME: A Frame-Based Knowledge Graph for Sensitive Personal Data

    Get PDF
    The pervasiveness of dialogue systems and virtual conversation applications raises an important theme: the potential of sharing sensitive information, and the consequent need for protection. To guarantee the subject’s right to privacy, and avoid the leakage of private content, it is important to treat sensitive information. However, any treatment requires firstly to identify sensitive text, and appropriate techniques to do it automatically. The Sensitive Information Detection (SID) task has been explored in the literature in different domains and languages, but there is no common benchmark. Current approaches are mostly based on artificial neural networks (ANN) or transformers based on them. Our research focuses on identifying categories of personal data in informal English sentences, by adopting a new logical-symbolic approach, and eventually hybridising it with ANN models. We present a frame-based knowledge graph built for personal data categories defined in the Data Privacy Vocabulary (DPV). The knowledge graph is designed through the logical composition of already existing frames, and has been evaluated as background knowledge for a SID system against a labeled sensitive information dataset. The accuracy of PRIVAFRAME reached 78%. By comparison, a transformer-based model achieved 12% lower performance on the same dataset. The top-down logical-symbolic frame-based model allows a granular analysis, and does not require a training dataset. These advantages lead us to use it as a layer in a hybrid model, where the logical SID is combined with an ANNs SID tested in a previous study by the authors

    Cross-Lingual Induction and Transfer of Verb Classes Based on Word Vector Space Specialisation

    Full text link
    Existing approaches to automatic VerbNet-style verb classification are heavily dependent on feature engineering and therefore limited to languages with mature NLP pipelines. In this work, we propose a novel cross-lingual transfer method for inducing VerbNets for multiple languages. To the best of our knowledge, this is the first study which demonstrates how the architectures for learning word embeddings can be applied to this challenging syntactic-semantic task. Our method uses cross-lingual translation pairs to tie each of the six target languages into a bilingual vector space with English, jointly specialising the representations to encode the relational information from English VerbNet. A standard clustering algorithm is then run on top of the VerbNet-specialised representations, using vector dimensions as features for learning verb classes. Our results show that the proposed cross-lingual transfer approach sets new state-of-the-art verb classification performance across all six target languages explored in this work.Comment: EMNLP 2017 (long paper

    The precedence of syntax in the rapid emergence of human language in evolution as defined by the integration hypothesis

    Get PDF
    Our core hypothesis is that the emergence of human language arose very rapidly from the linking of two pre-adapted systems found elsewhere in the animal world—an expression system, found, for example, in birdsong, and a lexical system, suggestively found in non-human primate calls (Miyagawa et al., 2013, 2014). We challenge the view that language has undergone a series of gradual changes—or a single preliminary protolinguistic stage—before achieving its full character. We argue that a full-fledged combinatorial operation Merge triggered the integration of these two pre-adapted systems, giving rise to a fully developed language. This goes against the gradualist view that there existed a structureless, protolinguistic stage, in which a rudimentary proto-Merge operation generated internally flat words. It is argued that compounds in present-day language are a fossilized form of this prior stage, a point which we will question
    corecore