1,539 research outputs found
Crowdsourced real-world sensing: sentiment analysis and the real-time web
The advent of the real-time web is proving both challeng-
ing and at the same time disruptive for a number of areas of research,
notably information retrieval and web data mining. As an area of research reaching maturity, sentiment analysis oers a promising direction for modelling the text content available in real-time streams. This paper reviews the real-time web as a new area of focus for sentiment analysis
and discusses the motivations and challenges behind such a direction
Social Analytics in an Enterprise Context: From Manufacturing to Software Development
Although customers become more and more vocal in expressing their experiences, demands and needs in various social networks, companies of any size typically fail to effectively gain insights from such social data and to eventually catch the market realm. This paper introduces the Anlzer analytics engine that aims at leveraging the "social" data deluge to help companies in their quest for deeper understanding of their products' perceptions as well as of the emerging trends in order to early embed them into their product design phase. The proposed approach brings together polarity detection and trend analysis techniques as presented in the architecture and demonstrated through a simple walkthrough in the Anlzer solution. The Anlzer implementation is by design domain-independent and is being tested in the furniture domain at the moment, yet it brings significant added value to software design and development, as well, through its experimentation playground that may provide indirect feedback on future software features while monitoring the reactions to existing releases
A Large-Scale Comparative Study of Accurate COVID-19 Information versus Misinformation
The COVID-19 pandemic led to an infodemic where an overwhelming amount of
COVID-19 related content was being disseminated at high velocity through social
media. This made it challenging for citizens to differentiate between accurate
and inaccurate information about COVID-19. This motivated us to carry out a
comparative study of the characteristics of COVID-19 misinformation versus
those of accurate COVID-19 information through a large-scale computational
analysis of over 242 million tweets. The study makes comparisons alongside four
key aspects: 1) the distribution of topics, 2) the live status of tweets, 3)
language analysis and 4) the spreading power over time. An added contribution
of this study is the creation of a COVID-19 misinformation classification
dataset. Finally, we demonstrate that this new dataset helps improve
misinformation classification by more than 9% based on average F1 measure
Multimodal Language Analysis with Recurrent Multistage Fusion
Computational modeling of human multimodal language is an emerging research
area in natural language processing spanning the language, visual and acoustic
modalities. Comprehending multimodal language requires modeling not only the
interactions within each modality (intra-modal interactions) but more
importantly the interactions between modalities (cross-modal interactions). In
this paper, we propose the Recurrent Multistage Fusion Network (RMFN) which
decomposes the fusion problem into multiple stages, each of them focused on a
subset of multimodal signals for specialized, effective fusion. Cross-modal
interactions are modeled using this multistage fusion approach which builds
upon intermediate representations of previous stages. Temporal and intra-modal
interactions are modeled by integrating our proposed fusion approach with a
system of recurrent neural networks. The RMFN displays state-of-the-art
performance in modeling human multimodal language across three public datasets
relating to multimodal sentiment analysis, emotion recognition, and speaker
traits recognition. We provide visualizations to show that each stage of fusion
focuses on a different subset of multimodal signals, learning increasingly
discriminative multimodal representations.Comment: EMNLP 201
Language representations for computational argumentation
Argumentation is an essential feature and, arguably, one of the most exciting phenomena of natural language use. Accordingly, it has fascinated scholars and researchers in various fields, such as linguistics and philosophy, for long. Its computational analysis, falling under the notion of computational argumentation, is useful in a variety of domains of text for a range of applications. For instance, it can help to understand users’ stances in online discussion forums towards certain controversies, to provide targeted feedback to users for argumentative writing support, and to automatically summarize scientific publications. As in all natural language processing pipelines, the text we would like to analyze has to be introduced to computational argumentation models in the form of numeric features. Choosing such suitable semantic representations is considered a core challenge in natural language processing. In this context, research employing static and
contextualized pretrained text embedding models has recently shown to reach state-of-the-art performances for a range of natural language processing tasks. However, previous work has noted the specific difficulty of computational argumentation scenarios with language representations as one of the main bottlenecks and called for targeted research on the intersection of the two fields. Still, the efforts focusing on the interplay between computational argumentation and representation learning have been few and far apart.
This is despite (a) the fast-growing body of work in both computational argumentation and representation learning in general and (b) the fact that some of the open challenges
are well known in the natural language processing community.
In this thesis, we address this research gap and acknowledge the specific importance of research on the intersection of representation learning and computational argumentation.
To this end, we (1) identify a series of challenges driven by inherent characteristics of argumentation in natural language and (2) present new analyses, corpora, and methods to address and mitigate each of the identified issues. Concretely, we focus on five main
challenges pertaining to the current state-of-the-art in computational argumentation:
(C1) External knowledge: static and contextualized language representations encode distributional knowledge only. We propose two approaches to complement this knowledge with knowledge from external resources. First, we inject lexico-semantic knowledge through an additional prediction objective in the pretraining stage. In a second study, we demonstrate how to inject conceptual knowledge post hoc employing the adapter framework. We show the effectiveness of these approaches on general natural language understanding and argumentative reasoning tasks.
(C2) Domain knowledge: pretrained language representations are typically trained on big and general-domain corpora. We study the trade-off between employing such large and general-domain corpora versus smaller and domain-specific corpora for training static word embeddings which we evaluate in the analysis of scientific arguments.
(C3) Complementarity of knowledge across tasks: many computational argumentation tasks are interrelated but are typically studied in isolation. In two case studies, we show the effectiveness of sharing knowledge across tasks. First, based on a corpus of scientific texts, which we extend with a new annotation layer reflecting fine-grained argumentative structures, we show that coupling the argumentative analysis with other rhetorical analysis tasks leads to performance improvements for the higher-level tasks.
In the second case study, we focus on assessing the argumentative quality of texts. To this end, we present a new multi-domain corpus annotated with ratings reflecting different dimensions of argument quality. We then demonstrate the effectiveness of sharing knowledge across the different quality dimensions in multi-task learning setups.
(C4) Multilinguality: argumentation arguably exists in all cultures and languages around the globe. To foster inclusive computational argumentation technologies, we dissect the current state-of-the-art in zero-shot cross-lingual transfer. We show big drops in performance when it comes to resource-lean and typologically distant target languages. Based on this finding, we analyze the reasons for these losses and propose to move to inexpensive few-shot target-language transfer, leading to consistent performance improvements in higher-level semantic tasks, e.g., argumentative reasoning.
(C5) Ethical considerations: envisioned computational argumentation applications, e.g., systems for self-determined opinion formation, are highly sensitive. We first discuss which ethical aspects should be considered when representing natural language for computational argumentation tasks. Focusing on the issue of unfair stereotypical bias, we then conduct a multi-dimensional analysis of the amount of bias in monolingual and cross-lingual embedding spaces. In the next step, we devise a general framework for implicit and explicit bias evaluation and debiasing. Employing intrinsic bias measures and benchmarks reflecting the semantic quality of the embeddings, we demonstrate the effectiveness of new debiasing methods, which we propose. Finally, we complement this analysis by testing the original as well as the debiased language representations for stereotypically unfair bias in argumentative inferences.
We hope that our contributions in language representations for computational argumentation fuel more research on the intersection of the two fields and contribute to fair, efficient, and effective natural language processing technologies
GPT-4V(ision) as A Social Media Analysis Engine
Recent research has offered insights into the extraordinary capabilities of
Large Multimodal Models (LMMs) in various general vision and language tasks.
There is growing interest in how LMMs perform in more specialized domains.
Social media content, inherently multimodal, blends text, images, videos, and
sometimes audio. Understanding social multimedia content remains a challenging
problem for contemporary machine learning frameworks. In this paper, we explore
GPT-4V(ision)'s capabilities for social multimedia analysis. We select five
representative tasks, including sentiment analysis, hate speech detection, fake
news identification, demographic inference, and political ideology detection,
to evaluate GPT-4V. Our investigation begins with a preliminary quantitative
analysis for each task using existing benchmark datasets, followed by a careful
review of the results and a selection of qualitative samples that illustrate
GPT-4V's potential in understanding multimodal social media content. GPT-4V
demonstrates remarkable efficacy in these tasks, showcasing strengths such as
joint understanding of image-text pairs, contextual and cultural awareness, and
extensive commonsense knowledge. Despite the overall impressive capacity of
GPT-4V in the social media domain, there remain notable challenges. GPT-4V
struggles with tasks involving multilingual social multimedia comprehension and
has difficulties in generalizing to the latest trends in social media.
Additionally, it exhibits a tendency to generate erroneous information in the
context of evolving celebrity and politician knowledge, reflecting the known
hallucination problem. The insights gleaned from our findings underscore a
promising future for LMMs in enhancing our comprehension of social media
content and its users through the analysis of multimodal information
Towards Knowledge-Grounded Counter Narrative Generation for Hate Speech
Tackling online hatred using informed textual responses - called counter
narratives - has been brought under the spotlight recently. Accordingly, a
research line has emerged to automatically generate counter narratives in order
to facilitate the direct intervention in the hate discussion and to prevent
hate content from further spreading. Still, current neural approaches tend to
produce generic/repetitive responses and lack grounded and up-to-date evidence
such as facts, statistics, or examples. Moreover, these models can create
plausible but not necessarily true arguments. In this paper we present the
first complete knowledge-bound counter narrative generation pipeline, grounded
in an external knowledge repository that can provide more informative content
to fight online hatred. Together with our approach, we present a series of
experiments that show its feasibility to produce suitable and informative
counter narratives in in-domain and cross-domain settings.Comment: To appear in "Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics (ACL): Findings
Exploiting word embeddings for modeling bilexical relations
There has been an exponential surge of text data in the recent years. As a consequence, unsupervised methods that make use of this data have been steadily growing in the field of natural language processing (NLP). Word embeddings are low-dimensional vectors obtained using unsupervised techniques on the large unlabelled corpora, where words from the vocabulary are mapped to vectors of real numbers. Word embeddings aim to capture syntactic and semantic properties of words.
In NLP, many tasks involve computing the compatibility between lexical items under some linguistic relation. We call this type of relation a bilexical relation. Our thesis defines statistical models for bilexical relations
that centrally make use of word embeddings. Our principle aim is that the word embeddings will favor generalization to words not seen during the training of the model.
The thesis is structured in four parts. In the first part of this thesis, we present a bilinear model over word embeddings that leverages a small supervised dataset for a binary linguistic relation. Our learning algorithm exploits low-rank bilinear forms and induces a low-dimensional embedding tailored for a target linguistic relation. This results in compressed task-specific embeddings.
In the second part of our thesis, we extend our bilinear model to a ternary
setting and propose a framework for resolving prepositional phrase attachment ambiguity using word embeddings. Our models perform competitively with state-of-the-art models. In addition, our method obtains significant improvements on out-of-domain tests by simply using word-embeddings induced from source and target domains.
In the third part of this thesis, we further extend the bilinear models for expanding vocabulary in the context of statistical phrase-based machine translation. Our model obtains a probabilistic list of possible translations of target language words, given a word in the source language. We do this by projecting pre-trained embeddings into a common subspace using a log-bilinear model. We empirically notice a significant improvement on an out-of-domain test set.
In the final part of our thesis, we propose a non-linear model that maps initial word embeddings to task-tuned word embeddings, in the context of a neural network dependency parser. We demonstrate its use for improved dependency parsing, especially for sentences with unseen words. We also show downstream improvements on a sentiment analysis task.En els darrers anys hi ha hagut un sorgiment notable de dades en format textual. Conseqüentment, en el camp del Processament del Llenguatge Natural (NLP, de l'anglès "Natural Language Processing") s'han desenvolupat mètodes no supervistats que fan ús d'aquestes dades. Els anomenats "word embeddings", o embeddings de paraules, són vectors de dimensionalitat baixa que s'obtenen mitjançant tècniques no supervisades aplicades a corpus textuals de grans volums. Com a resultat, cada paraula del diccionari es correspon amb un vector de nombres reals, el propòsit del qual és capturar propietats sintàctiques i semàntiques de la paraula corresponent. Moltes tasques de NLP involucren calcular la compatibilitat entre elements lèxics en l'àmbit d'una relació lingüística. D'aquest tipus de relació en diem relació bilèxica. Aquesta tesi proposa models estadístics per a relacions bilèxiques que fan ús central d'embeddings de paraules, amb l'objectiu de millorar la generalització del model lingüístic a paraules no vistes durant l'entrenament. La tesi s'estructura en quatre parts. A la primera part presentem un model bilineal sobre embeddings de paraules que explota un conjunt petit de dades anotades sobre una relaxió bilèxica. L'algorisme d'aprenentatge treballa amb formes bilineals de poc rang, i indueix embeddings de poca dimensionalitat que estan especialitzats per la relació bilèxica per la qual s'han entrenat. Com a resultat, obtenim embeddings de paraules que corresponen a compressions d'embeddings per a una relació determinada. A la segona part de la tesi proposem una extensió del model bilineal a trilineal, i amb això proposem un nou model per a resoldre ambigüitats de sintagmes preposicionals que usa només embeddings de paraules. En una sèrie d'avaluacións, els nostres models funcionen de manera similar a l'estat de l'art. A més, el nostre mètode obté millores significatives en avaluacions en textos de dominis diferents al d'entrenament, simplement usant embeddings induïts amb textos dels dominis d'entrenament i d'avaluació. A la tercera part d'aquesta tesi proposem una altra extensió dels models bilineals per ampliar la cobertura lèxica en el context de models estadístics de traducció automàtica. El nostre model probabilístic obté, donada una paraula en la llengua d'origen, una llista de possibles traduccions en la llengua de destí. Fem això mitjançant una projecció d'embeddings pre-entrenats a un sub-espai comú, usant un model log-bilineal. Empíricament, observem una millora significativa en avaluacions en dominis diferents al d'entrenament. Finalment, a la quarta part de la tesi proposem un model no lineal que indueix una correspondència entre embeddings inicials i embeddings especialitzats, en el context de tasques d'anàlisi sintàctica de dependències amb models neuronals. Mostrem que aquest mètode millora l'analisi de dependències, especialment en oracions amb paraules no vistes durant l'entrenament. També mostrem millores en un tasca d'anàlisi de sentiment
- …