11 research outputs found
Transfer Topic Labeling with Domain-Specific Knowledge Base: An Analysis of UK House of Commons Speeches 1935-2014
Topic models are widely used in natural language processing, allowing
researchers to estimate the underlying themes in a collection of documents.
Most topic models use unsupervised methods and hence require the additional
step of attaching meaningful labels to estimated topics. This process of manual
labeling is not scalable and suffers from human bias. We present a
semi-automatic transfer topic labeling method that seeks to remedy these
problems. Domain-specific codebooks form the knowledge-base for automated topic
labeling. We demonstrate our approach with a dynamic topic model analysis of
the complete corpus of UK House of Commons speeches 1935-2014, using the coding
instructions of the Comparative Agendas Project to label topics. We show that
our method works well for a majority of the topics we estimate; but we also
find that institution-specific topics, in particular on subnational governance,
require manual input. We validate our results using human expert coding
Two Computational Models for Analyzing Political Attention in Social Media
Understanding how political attention is divided and over what subjects is crucial for research on areas such as agenda setting, framing, and political rhetoric. However, existing methods for measuring attention, such as manual labeling ac- cording to established codebooks, are expensive and restric- tive. We describe two computational models that automati- cally distinguish topics in politicians’ social media content. Our models - one supervised classifier and one unsupervised topic model - provide different benefits. The supervised clas- sifier reduces the labor required to classify content accord- ing to pre-determined topic lists. However, tweets do more than communicate policy positions. Our unsupervised model uncovers both political topics and other Twitter uses (e.g., constituent service). Together, these models are effective, in- expensive computational tools for political communication and social media research. We demonstrate their utility and discuss the different analyses they afford by applying both models to the tweets posted by members of the 115th U.S. Congress.This material is based upon work supported by the National Science Foundation under Grant No. 1822228.https://deepblue.lib.umich.edu/bitstream/2027.42/147460/6/Hemphill and Schopke - Two Compuational Models.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/147460/1/Hemphill and Schopke - Two Computational Models.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/147460/8/ICWSM 2020 Two Computational Models.pptx5056Description of Hemphill and Schopke - Two Compuational Models.pdf : Revised articleDescription of Hemphill and Schopke - Two Computational Models.pdf : Main articleDescription of ICWSM 2020 Two Computational Models.pptx : Presentation with scrip
Multi-Dimensional Explanation of Target Variables from Documents
Automated predictions require explanations to be interpretable by humans.
Past work used attention and rationale mechanisms to find words that predict
the target variable of a document. Often though, they result in a tradeoff
between noisy explanations or a drop in accuracy. Furthermore, rationale
methods cannot capture the multi-faceted nature of justifications for multiple
targets, because of the non-probabilistic nature of the mask. In this paper, we
propose the Multi-Target Masker (MTM) to address these shortcomings. The
novelty lies in the soft multi-dimensional mask that models a relevance
probability distribution over the set of target variables to handle
ambiguities. Additionally, two regularizers guide MTM to induce long,
meaningful explanations. We evaluate MTM on two datasets and show, using
standard metrics and human annotations, that the resulting masks are more
accurate and coherent than those generated by the state-of-the-art methods.
Moreover, MTM is the first to also achieve the highest F1 scores for all the
target variables simultaneously.Comment: Accepted in AAAI 2021. 18 pages, 14 figures, 9 table
Propuesta de aprendizaje ontológico a partir de datos textuales que aporte a la construcción del carácter adaptativo de una conceptualización unificadora y formal del dominio de liderazgo
El liderazgo es un área importante de investigación que ha sido abordada desde diferentes enfoques. Lo anterior ha generado múltiples explicaciones o interpretaciones, llegando a surgir superposición y ambigüedad en la información construida. En consecuencia, se ha dificultado la concepción de una conceptualización unificadora, adecuadamente construida, consisa y sin ambiguedades que brinde una comprensión global de esta área. Este trabajo presenta una propuesta de aprendizaje ontológico a partir de datos textuales para establecer vocabulario y conceptos del domino de liderazgo. De esta forma, aportar a la construcción del carácter adaptativo de una conceptualización unificadora y formal de este dominio, que permitaidentificar y expresar los cambios que se experimentan en este dominio.Leadership is an important area of research that has been approached from different
perspectives. This has generated multiple explanations or interpretations, leading to overlapping
and ambiguity in the constructed information. Consequently, the conception of a unifying,
adequately constructed, concise and unambiguous conceptualization that provides a global
understanding of this area has been difficult. This work presents an ontological learning proposal
based on textual data to establish vocabulary, and concepts of the leadership domain. In this way,
contribute to the construction of the adaptive character of a unifying and formal conceptualization
of this domain, which allows identifying and expressing the changes that are experienced in this
domain.MaestríaMagíster en Investigación Operativa y EstadísticaContenido
Capítulo 1. Introducción.......................................................................................................... 9
Capítulo 2. Descripción del proyecto.................................................................................... 11
2.1. Planteamiento del problema de investigación y justificación ....................................... 11
Capítulo 3. Marco de referencia ........................................................................................... 17
3.1. Antecedentes y Estado del arte ..................................................................................... 17
3.1.1. Liderazgo................................................................................................................ 17
3.1.2. Esquematización del liderazgo............................................................................... 19
3.1.3. Aprendizaje ontológico .......................................................................................... 26
3.1.4. Modelos de tópicos.............................................................................................. 27
Capítulo 4. Objetivos ............................................................................................................. 31
Objetivo general:.................................................................................................................. 31
Objetivos específicos: .......................................................................................................... 31
Capítulo 5. Metodología......................................................................................................... 32
5.1. Fase I: Preparación de datos textuales .......................................................................... 33
5.1.1. Preprocesamiento de los documentos .................................................................... 33
5.2. Fase II: Evaluación del etiquetado del corpus............................................................... 34
5.3. Fase III: Construcción y evaluación de vocabulario de términos relevantes................ 34
5.3.1. Construcción y evaluación de vocabulario ............................................................ 34
5.3.2. Clasificación de Oraciones..................................................................................... 37
5.4. Fase IV: Conceptos....................................................................................................... 44
5.4.1. Modelos de tópicos ................................................................................................ 44
5.4.2. Evaluación basada en gold standard ...................................................................... 51
5.4.3. Topic coherence (TC) ............................................................................................ 51
5.4.4. Clasificación de documentos: ................................................................................ 51
Capítulo 6. Resultados y discusión ....................................................................................... 53
6.1. Fase I: Preprocesamiento .............................................................................................. 53
6.2. Fase II: Evaluación de las etiquetas: ............................................................................. 56
6.3. Fase III: Términos- Construcción y evaluación de vocabulario ................................... 57
6.3.1. Evaluación basada en gold standard ...................................................................... 57
6.3.2. Clasificación de oraciones...................................................................................... 59
6.4. Fase IV: Conceptos....................................................................................................... 62
6.4.1. Evaluación de estructuras ontológicas ................................................................... 63
Capítulo 7. Conclusiones ....................................................................................................... 73
Referencias.............................................................................................................................. 7
Macro-micro approach for mining public sociopolitical opinion from social media
During the past decade, we have witnessed the emergence of social media, which has prominence as a means for the general public to exchange opinions towards a broad range of topics. Furthermore, its social and temporal dimensions make it a rich resource for policy makers and organisations to understand public opinion. In this thesis, we present our research in understanding public opinion on Twitter along three dimensions: sentiment, topics and summary.
In the first line of our work, we study how to classify public sentiment on Twitter. We focus on the task of multi-target-specific sentiment recognition on Twitter, and propose an approach which utilises the syntactic information from parse-tree in conjunction with the left-right context of the target. We show the state-of-the-art performance on two datasets including a multi-target Twitter corpus on UK elections which we make public available for the research community. Additionally we also conduct two preliminary studies including cross-domain emotion classification on discourse around arts and cultural experiences, and social spam detection to improve the signal-to-noise ratio of our sentiment corpus.
Our second line of work focuses on automatic topical clustering of tweets. Our aim is to group tweets into a number of clusters, with each cluster representing a meaningful topic, story, event or a reason behind a particular choice of sentiment. We explore various ways of tackling this challenge and propose a two-stage hierarchical topic modelling system that is efficient and effective in achieving our goal.
Lastly, for our third line of work, we study the task of summarising tweets on common topics, with the goal to provide informative summaries for real-world events/stories or explanation underlying the sentiment expressed towards an issue/entity. As most existing tweet summarisation approaches rely on extractive methods, we propose to apply state-of-the-art neural abstractive summarisation model for tweets. We also tackle the challenge of cross-medium supervised summarisation with no target-medium training resources. To the best of our knowledge, there is no existing work on studying neural abstractive summarisation on tweets. In addition, we present a system for providing interactive visualisation of topic-entity sentiments and the corresponding summaries in chronological order.
Throughout our work presented in this thesis, we conduct experiments to evaluate and verify the effectiveness of our proposed models, comparing to relevant baseline methods. Most of our evaluations are quantitative, however, we do perform qualitative analyses where it is appropriate. This thesis provides insights and findings that can be used for better understanding public opinion in social media
A graph theoretical perspective for the unsupervised clustering of free text corpora
This thesis introduces a robust end to end topic discovery framework that extracts a set of coherent topics stemming intrinsically from document similarities. Some topic clustering methods can support embedded vectors instead of traditional Bag-of-Words (BoW) representation. Some can be free from the number of topics hyperparameter and some others can extract a multi-scale relation between topics. However, no topic clustering method supports all these properties together. This thesis focuses on this gap in the literature by designing a framework that supports any type of document-level features especially the embedded vectors. This framework does not require any uninformed decision making about the underlying data such as the number of topics, instead, the framework extracts topics in multiple resolutions. To achieve this goal, we combine existing methods from natural language processing (NLP) for feature generation and graph theory, first for graph construction based on semantic document similarities, then for graph partitioning to extract corresponding topics in multiple resolutions. Finally, we use specific methods from statistical machine learning to obtain highly generalisable supervised models to deploy topic classifiers for the deployment of topic extraction in real-time. Our applications on both a noisy and specialised corpus of medical records (i.e., descriptions for patient incidents within the NHS) and public news articles in daily language show that our framework extracts coherent topics that have better quantitative benchmark scores than other methods in most cases. The resulting multi-scale topics in both applications enable us to capture specific details more easily and choose the relevant resolutions for the specific objective. This study contributes to topic clustering literature by introducing a novel graph theoretical perspective that provides a combination of new properties. These properties are multiple resolutions, independence from uninformed decisions about the corpus, and usage of recent NLP features, such as vector embeddings.Open Acces