515 research outputs found

    Mapping WordNet Instances to Wikipedia

    Get PDF
    Lexical resource differ from encyclopaedic resources and represent two distinct types of resource covering general language and named entities respectively. However, many lexical resources, including Princeton WordNet, contain many proper nouns, referring to named entities in the world yet it is not possible or desirable for a lexical resource to cover all named entities that may reasonably occur in a text. In this paper, we propose that instead of including synsets for instance concepts PWN should instead provide links to Wikipedia articles describing the concept. In order to enable this we have created a gold-quality mapping between all of the 7,742 instances in PWN and Wikipedia (where such a mapping is possible). As such, this resource aims to provide a gold standard for link discovery, while also allowing PWN to distinguish itself from other resources such as DBpedia or BabelNet. Moreover, this linking connects PWN to the Linguistic Linked Open Data cloud, thus creating a richer, more usable resource for natural language processing

    Encoder-Attention-Based Automatic Term Recognition (EA-ATR)

    Get PDF
    Automated Term Recognition (ATR) is the task of finding terminology from raw text. It involves designing and developing techniques for the mining of possible terms from the text and filtering these identified terms based on their scores calculated using scoring methodologies like frequency of occurrence and then ranking the terms. Current approaches often rely on statistics and regular expressions over part-of-speech tags to identify terms, but this is error-prone. We propose a deep learning technique to improve the process of identifying a possible sequence of terms. We improve the term recognition by using Bidirectional Encoder Representations from Transformers (BERT) based embeddings to identify which sequence of words is a term. This model is trained on Wikipedia titles. We assume all Wikipedia titles to be the positive set, and random n-grams generated from the raw text as a weak negative set. The positive and negative set will be trained using the Embed, Encode, Attend and Predict (EEAP) formulation using BERT as embeddings. The model will then be evaluated against different domain-specific corpora like GENIA - annotated biological terms and Krapivin - scientific papers from the computer science domain

    An Introduction to the Five-Factor Model and Its Applications

    Get PDF
    The five-factor model of personality is a hierarchical organization of personality traits in terms of five basic dimensions: Extraversion, Agreeableness, Conscientiousness, Neuroticism, and Openness to Experience. Research using both natural language adjectives and theoretically based personality questionnaires supports the comprehensiveness of the model and its applicability across observers and cultures. This article summarizes the history of the model and its supporting evidence; discusses conceptions of the nature of the factors; and outlines an agenda for theorizing about the origins and operation of the factors. We argue that the model should prove useful both for individual assessment and for the elucidation of a number of topics of interest to personality psychologists

    Acción COST “Red europea para la ciencia de datos lingüísticos centrada en la web” (NexusLinguarum)

    Get PDF
    We present the current state of the large “European network for Web-centred linguistic data science”. In its first phase, the network has put in place several working groups to deal with specific topics. The network also already implemented a first round of Short Term Scientific Missions (STSM).Presentamos el estado actual de la “Red Europea para la ciencia de datos lingüísticos centrada en la Web”. En su primera fase, el proyecto ha establecido varios grupos de trabajo para tratar temas específicos. La red también implementó una primera ronda de Misiones Científicas de Corto Plazo (la sigla STSM en Ingles, para Short Term Scientifc Mission).Work presented here was supported in part by the COST Action CA18209 – NexusLinguarum “European network for Web-centred linguistic data science”, the project Prêt-à-LLOD, under grant agreement no. 825182, and the ELEXIS project, under grant agreement no. 731015

    Crowd-Sourcing A High-Quality Dataset for Metaphor Identification in Tweets

    Get PDF
    Metaphor is one of the most important elements of human communication, especially in informal settings such as social media. There have been a number of datasets created for metaphor identification, however, this task has proven difficult due to the nebulous nature of metaphoricity. In this paper, we present a crowd-sourcing approach for the creation of a dataset for metaphor identification, that is able to rapidly achieve large coverage over the different usages of metaphor in a given corpus while maintaining high accuracy. We validate this methodology by creating a set of 2,500 manually annotated tweets in English, for which we achieve inter-annotator agreement scores over 0.8, which is higher than other reported results that did not limit the task. This methodology is based on the use of an existing classifier for metaphor in order to assist in the identification and the selection of the examples for annotation, in a way that reduces the cognitive load for annotators and enables quick and accurate annotation. We selected a corpus of both general language tweets and political tweets relating to Brexit and we compare the resulting corpus on these two domains. As a result of this work, we have published the first dataset of tweets annotated for metaphors, which we believe will be invaluable for the development, training and evaluation of approaches for metaphor identification in tweets

    Comparison of Different Orthographies for Machine Translation of Under-Resourced Dravidian Languages

    Get PDF
    Under-resourced languages are a significant challenge for statistical approaches to machine translation, and recently it has been shown that the usage of training data from closely-related languages can improve machine translation quality of these languages. While languages within the same language family share many properties, many under-resourced languages are written in their own native script, which makes taking advantage of these language similarities difficult. In this paper, we propose to alleviate the problem of different scripts by transcribing the native script into common representation i.e. the Latin script or the International Phonetic Alphabet (IPA). In particular, we compare the difference between coarse-grained transliteration to the Latin script and fine-grained IPA transliteration. We performed experiments on the language pairs English-Tamil, English-Telugu, and English-Kannada translation task. Our results show improvements in terms of the BLEU, METEOR and chrF scores from transliteration and we find that the transliteration into the Latin script outperforms the fine-grained IPA transcription

    Towards a Crowd-Sourced WordNet for Colloquial English

    Get PDF
    Princeton WordNet is one of the most widely-used resources for natural language processing, but is updated only infrequently and cannot keep up with the fast-changing usage of the English language on social media platforms such as Twitter. The Colloquial WordNet aims to provide an open platform whereby anyone can contribute, while still following the structure of WordNet. Many crowdsourced lexical resources often have significant quality issues, and as such care must be taken in the design of the interface to ensure quality. In this paper, we present the development of a platform that can be opened on the Web to any lexicographer who wishes to contribute to this resource and the lexicographic methodology applied by this interfac

    A supervised approach to taxonomy extraction using word embeddings

    Get PDF
    Large collections of texts are commonly generated by large organizations and making sense of these collections of texts is a significant challenge. One method for handling this is to organize the concepts into a hierarchical structure such that similar concepts can be discovered and easily browsed. This approach was the subject of a recent evaluation campaign, TExEval, however the results of this task showed that none of the systems consistently outperformed a relatively simple baseline.In order to solve this issue, we propose a new method that uses supervised learning to combine multiple features with a support vector machine classifier including the baseline features. We show that this outperforms the baseline and thus provides a stronger method for identifying taxonomic relations than previous method
    • …
    corecore