Search CORE

11 research outputs found

Incorporating Emoji Descriptions Improves Tweet Classification

Author: Blanco Eduardo
Jin Wei
Singh Abhishek K.
Publication venue: Association for Computational Linguistics
Publication date: 01/01/2019
Field of study

Article presenting a simple strategy to process emojis in Tweets: replace them with their natural language description and use pretrained word embeddings as normally done with standard words. Results show that this strategy is more effective than using pretrained emoji embeddings for tweet classification

Crossref

UNT Digital Library

Time Expressions Recognition with Word Vectors and Neural Networks

Author: Etcheverry Mathias
Wonsever Dina
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 24th International Symposium on Temporal Representation and Reasoning (TIME 2017)
Publication date: 01/01/2017
Field of study

This work re-examines the widely addressed problem of the recognition and interpretation of time expressions, and suggests an approach based on distributed representations and artificial neural networks. Artificial neural networks allow us to build highly generic models, but the large variety of hyperparameters makes it difficult to determine the best configuration. In this work we study the behavior of different models by varying the number of layers, sizes and normalization techniques. We also analyze the behavior of distributed representations in the temporal domain, where we find interesting properties regarding order and granularity. The experiments were conducted mainly for Spanish, although this does not affect the approach, given its generic nature. This work aims to be a starting point towards processing temporality in texts via word vectors and neural networks, without the need of any kind of feature engineering

Dagstuhl Research Online Publication Server

Conception: Multilingually-Enhanced, Human-Readable Concept Vector Representations

Author: Conia Simone
Navigli Roberto
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

To date, the most successful word, word sense, and concept modelling techniques have used large corpora and knowledge resources to produce dense vector representations that capture semantic similarities in a relatively low-dimensional space. Most current approaches, however, suffer from a monolingual bias, with their strength depending on the amount of data available across languages. In this paper we address this issue and propose Conception, a novel technique for building language-independent vector representations of concepts which places multilinguality at its core while retaining explicit relationships between concepts. Our approach results in high-coverage representations that outperform the state of the art in multilingual and cross-lingual Semantic Word Similarity and Word Sense Disambiguation, proving particularly robust on low-resource languages. Conception – its software and the complete set of representations – is available at https://github.com/SapienzaNLP/conception

Crossref

Archivio della ricerca- Università di Roma La Sapienza

Sentiment Lexicon Adaptation with Context and Semantics for the Social Web

Author: Bollen
Feng
Lin
Shaffer
Thelwall
Turney
Turney
Weaver
Publication venue: 'IOS Press'
Publication date: 06/04/2017
Field of study

Sentiment analysis over social streams offers governments and organisations a fast and effective way to monitor the publics' feelings towards policies, brands, business, etc. General purpose sentiment lexicons have been used to compute sentiment from social streams, since they are simple and effective. They calculate the overall sentiment of texts by using a general collection of words, with predetermined sentiment orientation and strength. However, words' sentiment often vary with the contexts in which they appear, and new words might be encountered that are not covered by the lexicon, particularly in social media environments where content emerges and changes rapidly and constantly. In this paper, we propose a lexicon adaptation approach that uses contextual as well as semantic information extracted from DBPedia to update the words' weighted sentiment orientations and to add new words to the lexicon. We evaluate our approach on three different Twitter datasets, and show that enriching the lexicon with contextual and semantic information improves sentiment computation by 3.4% in average accuracy, and by 2.8% in average F1 measure

Crossref

Open Research Online (The Open University)

Discovering multiword expressions

Author: Aline Villavicencio
Attia
Baldwin
Barrett
Barrett
Biber
Calzolari
Camacho-Collados
Church
Clark
Curran
de Marneffe
Dunning
Firth
Frege
Kilgarriff
Kim
Kiros
Lapesa
Leacock
Lin
Manning
Marco Idiart
McCarthy
Melamed
Mitchell
Moon
Nunberg
Pearce
Peters
Roller
Sag
Salehi
Salehi
Schneider
Schneider
Schulte im Walde
Sporleder
Søgaard
Van de Cruys
Villavicencio
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/11/2019
Field of study

In this paper, we provide an overview of research on multiword expressions (MWEs), from a natural lan- guage processing perspective. We examine methods developed for modelling MWEs that capture some of their linguistic properties, discussing their use for MWE discovery and for idiomaticity detection. We con- centrate on their collocational and contextual preferences, along with their fixedness in terms of canonical forms and their lack of word-for-word translatatibility. We also discuss a sample of the MWE resources that have been used in intrinsic evaluation setups for these methods

Crossref

White Rose Research Online

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models

Author: Chen Jingchang
Chen Qianglong
Chu Zheng
Liu Ming
Qin Bing
Wang Haotian
Yu Weijiang
Publication venue
Publication date: 29/11/2023
Field of study

Understanding time is a pivotal aspect of human cognition, crucial in the broader framework of grasping the intricacies of the world. Previous studies typically focus on specific aspects of time, lacking a comprehensive temporal reasoning benchmark. To address this issue, we propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark that covers a broad spectrum of temporal reasoning phenomena, which provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models. We conduct extensive experiments on popular LLMs, such as GPT-4, LLaMA2, and Mistral, incorporating chain-of-thought prompting. Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans, highlighting that there is still a considerable distance to cover in temporal reasoning. We aspire for TimeBench to serve as a comprehensive benchmark, fostering research in temporal reasoning for LLMs. Our resource is available at https://github.com/zchuz/TimeBenchComment: Resources at: https://github.com/zchuz/TimeBenc

arXiv.org e-Print Archive

Exploiting Large Language Models to Train Automatic Detectors of Sensitive Data

Author: DE RENZIS SIMONE
Publication venue
Publication date: 21/12/2023
Field of study

openThis thesis proposes an automated system designed to identify sensitive data within text documents, aligning with the definitions and regulations outlined in the General Data Protection Regulation (GDPR). It reviews the current state of the art in Personally Identifiable Information (PII) and sensitive data detection, and how machine learning models for Natural Language Processing (NLP) are tailored to perform these tasks. A critical challenge addressed in this work pertains to the acquisition of suitable datasets for the training and evaluation of the proposed system. To overcome this obstacle, we explore the use of Large Language Model (LLM)s to generate synthetic datasets, thus serving as a valuable resource for training classification models. Both proprietary and open-source LLMs are leveraged to investigate the capabilities of local models in document generation. It then presents a comprehensive framework for sensitive data detection, covering six key domains and proposing specific criteria to identify the disclosure of sensitive data, which take into account the context and the domain relevance. To achieve the detection of sensitive data, a variety of models are explored, mainly based on the Transformer architecture (Bidirectional Encoder Representations from Transformers (BERT)), adapted to fulfill tasks of text classification and Named Entity Recognition (NER). It evaluates the performance of the models using fine-grained metrics, and shows that the NER model achieves the best results (90% score) when trained interchangeably on both datasets, also confirming the quality of the dataset generated with the open source LLM.This thesis proposes an automated system designed to identify sensitive data within text documents, aligning with the definitions and regulations outlined in the General Data Protection Regulation (GDPR). It reviews the current state of the art in Personally Identifiable Information (PII) and sensitive data detection, and how machine learning models for Natural Language Processing (NLP) are tailored to perform these tasks. A critical challenge addressed in this work pertains to the acquisition of suitable datasets for the training and evaluation of the proposed system. To overcome this obstacle, we explore the use of Large Language Model (LLM)s to generate synthetic datasets, thus serving as a valuable resource for training classification models. Both proprietary and open-source LLMs are leveraged to investigate the capabilities of local models in document generation. It then presents a comprehensive framework for sensitive data detection, covering six key domains and proposing specific criteria to identify the disclosure of sensitive data, which take into account the context and the domain relevance. To achieve the detection of sensitive data, a variety of models are explored, mainly based on the Transformer architecture (Bidirectional Encoder Representations from Transformers (BERT)), adapted to fulfill tasks of text classification and Named Entity Recognition (NER). It evaluates the performance of the models using fine-grained metrics, and shows that the NER model achieves the best results (90% score) when trained interchangeably on both datasets, also confirming the quality of the dataset generated with the open source LLM

Padua Thesis and Dissertation Archive

Enriching Affect Analysis Through Emotion and Sarcasm Detection

Author: Agrawal Ameeta
Publication venue
Publication date: 27/08/2018
Field of study

Affect detection from text is the task of detecting affective states such as sentiment, mood and emotions from natural language text including news comments, product reviews, discussion posts, tweets and so on. Broadly speaking, affect detection includes the related tasks of sentiment analysis, emotion detection and sarcasm detection, amongst others. In this dissertation, we seek to enrich textual affect analysis from two perspectives: emotion and sarcasm. Emotion detection entails classifying the text into fine-grained categories of emotions such as happiness, sadness, surprise, and so on, whereas sarcasm detection seeks to identify the presence or absence of sarcasm in text. The task of emotion detection is particularly challenging due to limited number of resources and as it involves a greater number of categories of emotions in which to undertake classification, with no fixed number or types of emotions. Similarly, the recently proposed task of sarcasm detection is complicated due to the inherent sophisticated nature of sarcasm, where one typically says or writes the opposite of what they mean. This dissertation consists of five contributions. First, we address word-emotion association, a fundamental building block of most, if not all, emotion detection systems. Current approaches to emotion detection rely on a handful of manually annotated resources such as lexicons and datasets for deriving word-emotion association. Instead, we propose novel models for augmenting word-emotion association to support unsupervised learning which does not require labeled training data and can be extended to flexible taxonomies of emotions. Second, we study the problem of affective word representations, where affectively similar words are projected into neighboring regions of an n-dimensional embedding space. While existing techniques usually consider the lexical semantics and syntax of co-occurring words, thus rating emotionally dissimilar words occurring in similar contexts as highly similar, we integrate a rich spectrum of emotions into representation learning in order to cluster emotionally similar words closer, and emotionally dissimilar words farther from each other. The generated emotion-enriched word representations are found to be better at capturing relevant features useful for sentence-level emotion classification and emotion similarity tasks. Third, we investigate the problem of computational sarcasm detection. Generally, sarcasm detection is treated as a linguistic and lexical phenomena with limited emphasis on the emotional aspects of sarcasm. In order to address this gap, we propose novel models of enriching sarcasm detection by incorporating affective knowledge. In particular, document-level features obtained from affective word representations are utilized in designing classification systems. Through extensive evaluation on six datasets from three diverse domains of text, we demonstrate the potential of exploiting automatically induced features without the need for considerable manual feature engineering. Motivated by the importance of affective knowledge in detecting sarcasm, the fourth contribution of this thesis seeks to dig deeper and study the role of transitions and relationships between different emotions in order to discover which emotions serve as more informative and discriminative features for distinguishing sarcastic utterances in text. Lastly, we show the usefulness of our proposed affective models by applying them in a non-affective framework of predicting the helpfulness of online reviews

YorkSpace

Robust input representations for low-resource information extraction

Author: Lange Lukas
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2022
Field of study

Recent advances in the field of natural language processing were achieved with deep learning models. This led to a wide range of new research questions concerning the stability of such large-scale systems and their applicability beyond well-studied tasks and datasets, such as information extraction in non-standard domains and languages, in particular, in low-resource environments. In this work, we address these challenges and make important contributions across fields such as representation learning and transfer learning by proposing novel model architectures and training strategies to overcome existing limitations, including a lack of training resources, domain mismatches and language barriers. In particular, we propose solutions to close the domain gap between representation models by, e.g., domain-adaptive pre-training or our novel meta-embedding architecture for creating a joint representations of multiple embedding methods. Our broad set of experiments demonstrates state-of-the-art performance of our methods for various sequence tagging and classification tasks and highlight their robustness in challenging low-resource settings across languages and domains.Die jüngsten Fortschritte auf dem Gebiet der Verarbeitung natürlicher Sprache wurden mit Deep-Learning-Modellen erzielt. Dies führte zu einer Vielzahl neuer Forschungsfragen bezüglich der Stabilität solcher großen Systeme und ihrer Anwendbarkeit über gut untersuchte Aufgaben und Datensätze hinaus, wie z. B. die Informationsextraktion für Nicht-Standardsprachen, aber auch Textdomänen und Aufgaben, für die selbst im Englischen nur wenige Trainingsdaten zur Verfügung stehen. In dieser Arbeit gehen wir auf diese Herausforderungen ein und leisten wichtige Beiträge in Bereichen wie Repräsentationslernen und Transferlernen, indem wir neuartige Modellarchitekturen und Trainingsstrategien vorschlagen, um bestehende Beschränkungen zu überwinden, darunter fehlende Trainingsressourcen, ungesehene Domänen und Sprachbarrieren. Insbesondere schlagen wir Lösungen vor, um die Domänenlücke zwischen Repräsentationsmodellen zu schließen, z.B. durch domänenadaptives Vortrainieren oder unsere neuartige Meta-Embedding-Architektur zur Erstellung einer gemeinsamen Repräsentation mehrerer Embeddingmethoden. Unsere umfassende Evaluierung demonstriert die Leistungsfähigkeit unserer Methoden für verschiedene Klassifizierungsaufgaben auf Word und Satzebene und unterstreicht ihre Robustheit in anspruchsvollen, ressourcenarmen Umgebungen in verschiedenen Sprachen und Domänen

Universaar

Acronym

Recommended from our members

Content Selection for Timeline Generation from Single History Articles

Author: Bauer Sandro Mario
Publication venue: University of Cambridge
Publication date: 09/11/2017
Field of study

This thesis investigates the problem of content selection for timeline generation from single history articles. While the task of timeline generation has been addressed before, most previous approaches assume the existence of a large corpus of history articles from the same era. They exploit the fact that salient information is likely to be mentioned multiple times in such corpora. However, large resources of this kind are only available for historical events that happened in the most recent decades. In this thesis, I present approaches which can be used to create history timelines for any historical period, even for eras such as the Middle Ages, for which no large corpora of supplementary text exist. The thesis first presents a system that selects relevant historical figures in a given article, a task which is substantially easier than full timeline generation. I show that a supervised approach which uses linguistic, structural and semantic features outperforms a competitive baseline on this task. Based on the observations made in this initial study, I then develop approaches for timeline generation. I find that an unsupervised approach that takes into account the article's subject area outperforms several supervised and unsupervised baselines. A main focus of this thesis is the development of evaluation methodologies and resources, as no suitable corpora existed when work began. For the initial experiment on important historical figures, I construct a corpus of existing timelines and textual articles, and devise a method for evaluating algorithms based on this resource. For timeline generation, I present a comprehensive evaluation methodology which is based on the interpretation of the task as a special form of single-document summarisation. This methodology scores algorithms based on meaning units rather than surface similarity. Unlike previous semantic-units-based evaluation methods for summarisation, my evaluation method does not require any manual annotation of system timelines. Once an evaluation resource has been created, which involves only annotation of the input texts, new timeline generation algorithms can be tested at no cost. This crucial advantage should make my new evaluation methodology attractive for the evaluation of general single-document summaries beyond timelines. I also present an evaluation resource which is based on this methodology. It was constructed using gold-standard timelines elicited from 30 human timeline writers, and has been made publicly available. This thesis concentrates on the content selection stage of timeline generation, and leaves the surface realisation step for future work. However, my evaluation methodology is designed in such a way that it can in principle also quantify the degree to which surface realisation is successful

Apollo (Cambridge)