335 research outputs found
Detecting Policy Preferences and Dynamics in the UN General Debate with Neural Word Embeddings
Foreign policy analysis has been struggling to find ways to measure policy
preferences and paradigm shifts in international political systems. This paper
presents a novel, potential solution to this challenge, through the application
of a neural word embedding (Word2vec) model on a dataset featuring speeches by
heads of state or government in the United Nations General Debate. The paper
provides three key contributions based on the output of the Word2vec model.
First, it presents a set of policy attention indices, synthesizing the semantic
proximity of political speeches to specific policy themes. Second, it
introduces country-specific semantic centrality indices, based on topological
analyses of countries' semantic positions with respect to each other. Third, it
tests the hypothesis that there exists a statistical relation between the
semantic content of political speeches and UN voting behavior, falsifying it
and suggesting that political speeches contain information of different nature
then the one behind voting outcomes. The paper concludes with a discussion of
the practical use of its results and consequences for foreign policy analysis,
public accountability, and transparency
Near Real-Time Sentiment and Topic Analysis of Sport Events
Sport events’ media consumption patterns have started transitioning to a multi-screen paradigm, where, through multitasking, viewers are able to search for additional information about the event they are watching live, as well as contribute with their perspective of the event to other viewers. The audiovisual and multimedia industries, however, are failing to capitalize on this by not providing the sports’ teams and those in charge of the audiovisual production with insights on the final consumers perspective of sport events. As a result of this opportunity, this document focuses on presenting the development of a near real-time sentiment analysis tool and a near real-time topic analysis tool for the analysis of sports events’ related social media content that was published during the transmission of the respective events, thus enabling, in near real-time, the understanding of the sentiment of the viewers and the topics being discussed through each event.Os padrões de consumo de media, têm vindo a mudar para um paradigma de ecrãs múltiplos, onde, através de multitasking, os telespetadores podem pesquisar informações adicionais sobre o evento que estão a assistir, bem como partilhar a sua perspetiva do evento. As indústrias do setor audiovisual e multimédia, no entanto, não estão a aproveitar esta oportunidade, falhando em fornecer às equipas desportivas e aos responsáveis pela produção audiovisual uma visão sobre a perspetiva dos consumidores finais dos eventos desportivos. Como resultado desta oportunidade, este documento foca-se em apresentar o desenvolvimento de uma ferramenta de análise de sentimento e uma ferramenta de análise de tópicos para a análise, em perto de tempo real, de conteúdo das redes sociais relacionado com eventos esportivos e publicado durante a transmissão dos respetivos eventos, permitindo assim, em perto de tempo real, perceber o sentimento dos espectadores e os tópicos mais falados durante cada evento
Linear mappings: semantic transfer from transformer models for cognate detection and coreference resolution
Includes bibliographical references.2022 Fall.Embeddings or vector representations of language and their properties are useful for understanding how Natural Language Processing technology works. The usefulness of embeddings, however, depends on how contextualized or information-rich such embeddings are. In this work, I apply a novel affine (linear) mapping technique first established in the field of computer vision to embeddings generated from large Transformer-based language models. In particular, I study its use in two challenging linguistic tasks: cross-lingual cognate detection and cross-document coreference resolution. Cognate detection for two Low-Resource Languages (LRL), Assamese and Bengali, is framed as a binary classification problem using semantic (embedding-based), articulatory, and phonetic features. Linear maps for this task are extrinsically evaluated on the extent of transfer of semantic information between monolingual as well as multi-lingual models including those specialized for low-resourced Indian languages. For cross-document coreference resolution, whole-document contextual representations are generated for event and entity mentions from cross- document language models like CDLM and other BERT-variants and then linearly mapped to form coreferring clusters based on their cosine similarities. I evaluate my results on gold output based on established coreference metrics like BCUB and MUC. My findings reveal that linearly transforming vectors from one model's embedding space to another carries certain semantic information with high fidelity thereby revealing the existence of a canonical embedding space and its geometric properties for language models. Interestingly, even for a much more challenging task like coreference resolution, linear maps are able to transfer semantic information between "lighter" models or less contextual models and "larger" models with near-equivalent performance or even improved results in some cases
Unlocking the insights: data-analytic examinations of social signals in organizational practices
Using analytics to understand social data is an emerging research field across different academic disciplines including Information Systems. Information Systems (IS) discipline places the social media-related research at the intersection of people, organizations, and technology. While the recent focus of design science utilizes advanced data mining methods to develop decision-aiding frameworks and extend organizational boundaries by designing novel artifacts; the behavioral focus develops and tests theories that elucidate behaviors at the individual or organizational levels. This thesis comprises three essays aiming to contribute to IS research by both leveraging from the two foundational principles of IS -behavioral science and design science- and applying advanced analytics to further understand the executive, stakeholder, and brand relations. These implementations allow us to elegantly capture several important features of the social signals that have been employed in online networking platforms. In the first two chapters, social signals are examined as internal and external associations of the executives’ career outcomes. The third chapter investigates the social signals from brands’ perspectives. While the first two chapters highlight and examine the relationships of several constructs, the main emphasis of the third chapter lies on designing a valuable information systems tool to complement the theoretical studies and enhance the practical implementations
Exploiting word embeddings for modeling bilexical relations
There has been an exponential surge of text data in the recent years. As a consequence, unsupervised methods that make use of this data have been steadily growing in the field of natural language processing (NLP). Word embeddings are low-dimensional vectors obtained using unsupervised techniques on the large unlabelled corpora, where words from the vocabulary are mapped to vectors of real numbers. Word embeddings aim to capture syntactic and semantic properties of words.
In NLP, many tasks involve computing the compatibility between lexical items under some linguistic relation. We call this type of relation a bilexical relation. Our thesis defines statistical models for bilexical relations
that centrally make use of word embeddings. Our principle aim is that the word embeddings will favor generalization to words not seen during the training of the model.
The thesis is structured in four parts. In the first part of this thesis, we present a bilinear model over word embeddings that leverages a small supervised dataset for a binary linguistic relation. Our learning algorithm exploits low-rank bilinear forms and induces a low-dimensional embedding tailored for a target linguistic relation. This results in compressed task-specific embeddings.
In the second part of our thesis, we extend our bilinear model to a ternary
setting and propose a framework for resolving prepositional phrase attachment ambiguity using word embeddings. Our models perform competitively with state-of-the-art models. In addition, our method obtains significant improvements on out-of-domain tests by simply using word-embeddings induced from source and target domains.
In the third part of this thesis, we further extend the bilinear models for expanding vocabulary in the context of statistical phrase-based machine translation. Our model obtains a probabilistic list of possible translations of target language words, given a word in the source language. We do this by projecting pre-trained embeddings into a common subspace using a log-bilinear model. We empirically notice a significant improvement on an out-of-domain test set.
In the final part of our thesis, we propose a non-linear model that maps initial word embeddings to task-tuned word embeddings, in the context of a neural network dependency parser. We demonstrate its use for improved dependency parsing, especially for sentences with unseen words. We also show downstream improvements on a sentiment analysis task.En els darrers anys hi ha hagut un sorgiment notable de dades en format textual. Conseqüentment, en el camp del Processament del Llenguatge Natural (NLP, de l'anglès "Natural Language Processing") s'han desenvolupat mètodes no supervistats que fan ús d'aquestes dades. Els anomenats "word embeddings", o embeddings de paraules, són vectors de dimensionalitat baixa que s'obtenen mitjançant tècniques no supervisades aplicades a corpus textuals de grans volums. Com a resultat, cada paraula del diccionari es correspon amb un vector de nombres reals, el propòsit del qual és capturar propietats sintà ctiques i semà ntiques de la paraula corresponent. Moltes tasques de NLP involucren calcular la compatibilitat entre elements lèxics en l'à mbit d'una relació lingüÃstica. D'aquest tipus de relació en diem relació bilèxica. Aquesta tesi proposa models estadÃstics per a relacions bilèxiques que fan ús central d'embeddings de paraules, amb l'objectiu de millorar la generalització del model lingüÃstic a paraules no vistes durant l'entrenament. La tesi s'estructura en quatre parts. A la primera part presentem un model bilineal sobre embeddings de paraules que explota un conjunt petit de dades anotades sobre una relaxió bilèxica. L'algorisme d'aprenentatge treballa amb formes bilineals de poc rang, i indueix embeddings de poca dimensionalitat que estan especialitzats per la relació bilèxica per la qual s'han entrenat. Com a resultat, obtenim embeddings de paraules que corresponen a compressions d'embeddings per a una relació determinada. A la segona part de la tesi proposem una extensió del model bilineal a trilineal, i amb això proposem un nou model per a resoldre ambigüitats de sintagmes preposicionals que usa només embeddings de paraules. En una sèrie d'avaluacións, els nostres models funcionen de manera similar a l'estat de l'art. A més, el nostre mètode obté millores significatives en avaluacions en textos de dominis diferents al d'entrenament, simplement usant embeddings induïts amb textos dels dominis d'entrenament i d'avaluació. A la tercera part d'aquesta tesi proposem una altra extensió dels models bilineals per ampliar la cobertura lèxica en el context de models estadÃstics de traducció automà tica. El nostre model probabilÃstic obté, donada una paraula en la llengua d'origen, una llista de possibles traduccions en la llengua de destÃ. Fem això mitjançant una projecció d'embeddings pre-entrenats a un sub-espai comú, usant un model log-bilineal. EmpÃricament, observem una millora significativa en avaluacions en dominis diferents al d'entrenament. Finalment, a la quarta part de la tesi proposem un model no lineal que indueix una correspondència entre embeddings inicials i embeddings especialitzats, en el context de tasques d'anà lisi sintà ctica de dependències amb models neuronals. Mostrem que aquest mètode millora l'analisi de dependències, especialment en oracions amb paraules no vistes durant l'entrenament. També mostrem millores en un tasca d'anà lisi de sentiment
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020
Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)
Politische Maschinen: Maschinelles Lernen für das Verständnis von sozialen Maschinen
This thesis investigates human-algorithm interactions in sociotechnological ecosystems. Specifically, it applies machine learning and statistical methods to uncover political dimensions of algorithmic influence in social media platforms and automated decision making systems. Based on the results, the study discusses the legal, political and ethical consequences of algorithmic implementations.Diese Arbeit untersucht Mensch-Algorithmen-Interaktionen in sozio-technologischen Ökosystemen. Sie wendet maschinelles Lernen und statistische Methoden an, um politische Dimensionen des algorithmischen Einflusses auf Socialen Medien und automatisierten Entscheidungssystemen aufzudecken. Aufgrund der Ergebnisse diskutiert die Studie die rechtlichen, politischen und ethischen Konsequenzen von algorithmischen Anwendungen
Studying user income through language, behaviour and affect in social media
Automatically inferring user demographics from social media posts is useful for both social science research and a range of downstream applications in marketing and politics. We present the first extensive study where user behaviour on Twitter is used to build a predictive model of income. We apply non-linear methods for regression, i.e. Gaussian Processes, achieving strong correlation between predicted and actual user income. This allows us to shed light on the factors that characterise income on Twitter and analyse their interplay with user emotions and sentiment, perceived psycho-demographics and language use expressed through the topics of their posts. Our analysis uncovers correlations between different feature categories and income, some of which reflect common belief e.g. higher perceived education and intelligence indicates higher earnings, known differences e.g. gender and age differences, however, others show novel findings e.g. higher income users express more fear and anger, whereas lower income users express more of the time emotion and opinions
Recommended from our members
Social Measurement and Causal Inference with Text
The digital age has dramatically increased access to large-scale collections of digitized text documents. These corpora include, for example, digital traces from social media, decades of archived news reports, and transcripts of spoken interactions in political, legal, and economic spheres. For social scientists, this new widespread data availability has potential for improved quantitative analysis of relationships between language use and human thought, actions, and societal structure. However, the large-scale nature of these collections means that traditional manual approaches to analyzing content are extremely costly and do not scale. Furthermore, incorporating unstructured text data into quantitative analysis is difficult due to texts’ high-dimensional nature and linguistic complexity.
This thesis blends (a) the computational strengths of natural language processing (NLP) and machine learning to automate and scale-up quantitative text analysis with (b) two themes central to social scientific studies but often under-addressed in NLP: measurement—creating quantifiable summaries of empirical phenomena—and causal inference—estimating the effects of interventions. First, we address measuring class prevalence in document collections; we contribute a generative probabilistic modeling approach to prevalence estimation and show empirically that our model is more robust to shifts in class priors between training and inference. Second, we examine cross- document entity-event measurement; we contribute an empirical pipeline and a novel latent disjunction model to identify the names of civilians killed by police from our corpus of web-scraped news reports. Third, we gather and categorize applications that use text to reduce confounding from causal estimates and contribute a list of open problems as well as guidance about data processing and evaluation decisions in this area. Finally, we contribute a new causal research design to estimate the natural indirect and direct effects of social group signals (e.g. race or gender) on conversational outcomes with separate aspects of language as causal mediators; this chapter is motivated by a theoretical case study of U.S. Supreme Court oral arguments and the effect of an advocate’s gender on interruptions from justices. We conclude by discussing the relationship between measurement and causal inference with text and future work at this intersection
- …