13 research outputs found

    Crowd-Sourcing A High-Quality Dataset for Metaphor Identification in Tweets

    Get PDF
    Metaphor is one of the most important elements of human communication, especially in informal settings such as social media. There have been a number of datasets created for metaphor identification, however, this task has proven difficult due to the nebulous nature of metaphoricity. In this paper, we present a crowd-sourcing approach for the creation of a dataset for metaphor identification, that is able to rapidly achieve large coverage over the different usages of metaphor in a given corpus while maintaining high accuracy. We validate this methodology by creating a set of 2,500 manually annotated tweets in English, for which we achieve inter-annotator agreement scores over 0.8, which is higher than other reported results that did not limit the task. This methodology is based on the use of an existing classifier for metaphor in order to assist in the identification and the selection of the examples for annotation, in a way that reduces the cognitive load for annotators and enables quick and accurate annotation. We selected a corpus of both general language tweets and political tweets relating to Brexit and we compare the resulting corpus on these two domains. As a result of this work, we have published the first dataset of tweets annotated for metaphors, which we believe will be invaluable for the development, training and evaluation of approaches for metaphor identification in tweets

    Metaphor processing in tweets

    Get PDF
    Metaphor plays an important role in defining the interplay between cognition and language. Despite its fuzziness, this ubiquitous figurative device is an essential element of human communication that allows us (as humans) to better understand and, thus, communicate unfamiliar experiences and concepts in terms of familiar ones. Metaphor comprehension and understanding is a complex cognitive task that includes grasping the interaction between the underlying concepts. This is very challenging for humans, let alone computers. The last few decades have witnessed a growing interest in automating this cognitive process by introducing a wealth of ideas to model the computational recognition and comprehension of metaphors in text. Many approaches and techniques have been introduced to explore the automatic processing of different types of metaphors and the preparation of metaphor-related resources. In spite of the attention that metaphor processing has gained recently, the majority of existing approaches do not process metaphors in informal settings such as social media. Twitter offers a novel way of communication that enables users all over the world to share their thoughts and experiences. The social media content circulated on this platform through the short informal tweets poses a challenge for automatic language processing due to the unstructured nature and brevity of the text as well as the vagueness of topics. Such unique characteristics of tweets, coupled with the importance of studying metaphoric usage on social media motivated me to study metaphor processing in such a context. Metaphor processing in tweets can be beneficial in many social media analysis applications, including political discourse analysis and health communication analysis. In this thesis, I investigate the automatic processing of metaphors in tweets focusing on two main tasks, namely metaphor identification and interpretation. My aim is to improve metaphor identification to study the usage of metaphoric language in healthcare communication and political discourse in social media. Furthermore, I aim to improve metaphor interpretation to aid language learners and to enrich lexical resources. I, therefore, study various NLP and deep learning techniques to automatically identify and interpret metaphors in tweets. To the best of my knowledge, there has been no attempt to process metaphors in tweets in part due to the lack of tweet datasets annotated for linguistic metaphor. Thus, the focus of the work presented here is not only introducing models to process metaphors in tweets but also developing the necessary datasets. Overall, the work is divided into three main research themes; the first focuses on the development of metaphor annotation schemes and the preparation of datasets for both tasks. The second is concerned with the automatic identification of linguistic metaphors in tweets under a relational paradigm which explores three main ideas, namely distributional semantics, meta-embedding learning and contextual modulation. Finally, the last theme focuses on metaphor interpretation along the more complex ``definition generation'' approach, which provides full explanation of a given metaphoric expression. Experiments are conducted on the introduced datasets of tweets as well as benchmark metaphor datasets to show the effectiveness of the proposed approaches. Furthermore, the proposed datasets and the best models from this thesis will be made publicly available to facilitate research on metaphor processing in general and in tweets specifically

    C4Corpus (CC BY-SA part)

    No full text
    A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs

    C4Corpus (CC BY-NC-SA part)

    No full text
    A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs

    C4Corpus (CC BY-NC part)

    No full text
    A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs

    C4Corpus (CC-BY part)

    No full text
    A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs

    C4Corpus: Multilingual Web-size corpus with free license

    No full text
    Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks. In this article we present the construction of 12 million-pages Web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs. Our highly-scalable Hadoop-based framework is able to process the full CommonCrawl corpus on 2000+ CPU cluster on the Amazon Elastic Map/Reduce infrastructure. The processing pipeline includes license identification, state-of-the-art boilerplate removal, exact duplicate and near-duplicate document removal, and language detection. The construction of the corpus is highly configurable and fully reproducible, and we provide both the framework (DKPro C4CorpusTools) and the resulting data (C4Corpus) to the research community

    C4Corpus (CC BY-ND part)

    No full text
    A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs

    C4Corpus (publicdomain part)

    No full text
    A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs

    C4Corpus (CC BY-NC-ND part)

    No full text
    A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs
    corecore