2,587 research outputs found

    When topic models disagree: keyphrase extraction with mulitple topic models

    Get PDF
    We explore how the unsupervised extraction of topic-related keywords benefits from combining multiple topic models. We show that averaging multiple topic models, inferred from different corpora, leads to more accurate keyphrases than when using a single topic model and other state-of-the-art techniques. The experiments confirm the intuitive idea that a prerequisite for the significant benefit of combining multiple models is that the models should be sufficiently different, i.e., they should provide distinct contexts in terms of topical word importance

    Knowledge Base Population using Semantic Label Propagation

    Get PDF
    A crucial aspect of a knowledge base population system that extracts new facts from text corpora, is the generation of training data for its relation extractors. In this paper, we present a method that maximizes the effectiveness of newly trained relation extractors at a minimal annotation cost. Manual labeling can be significantly reduced by Distant Supervision, which is a method to construct training data automatically by aligning a large text corpus with an existing knowledge base of known facts. For example, all sentences mentioning both 'Barack Obama' and 'US' may serve as positive training instances for the relation born_in(subject,object). However, distant supervision typically results in a highly noisy training set: many training sentences do not really express the intended relation. We propose to combine distant supervision with minimal manual supervision in a technique called feature labeling, to eliminate noise from the large and noisy initial training set, resulting in a significant increase of precision. We further improve on this approach by introducing the Semantic Label Propagation method, which uses the similarity between low-dimensional representations of candidate training instances, to extend the training set in order to increase recall while maintaining high precision. Our proposed strategy for generating training data is studied and evaluated on an established test collection designed for knowledge base population tasks. The experimental results show that the Semantic Label Propagation strategy leads to substantial performance gains when compared to existing approaches, while requiring an almost negligible manual annotation effort.Comment: Submitted to Knowledge Based Systems, special issue on Knowledge Bases for Natural Language Processin

    Topical word importance for fast keyphrase extraction

    Get PDF
    We propose an improvement on a state-of-the-art keyphrase extraction algorithm, Topical PageRank (TPR), incorporating topical information from topic models. While the original algorithm requires a random walk for each topic in the topic model being used, ours is independent of the topic model, computing but a single PageRank for each text regardless of the amount of topics in the model. This increases the speed drastically and enables it for use on large collections of text using vast topic models, while not altering performance of the original algorithm

    Break it Down for Me: A Study in Automated Lyric Annotation

    Get PDF
    Comprehending lyrics, as found in songs and poems, can pose a challenge to human and machine readers alike. This motivates the need for systems that can understand the ambiguity and jargon found in such creative texts, and provide commentary to aid readers in reaching the correct interpretation. We introduce the task of automated lyric annotation (ALA). Like text simplification, a goal of ALA is to rephrase the original text in a more easily understandable manner. However, in ALA the system must often include additional information to clarify niche terminology and abstract concepts. To stimulate research on this task, we release a large collection of crowdsourced annotations for song lyrics. We analyze the performance of translation and retrieval models on this task, measuring performance with both automated and human evaluation. We find that each model captures a unique type of information important to the task.Comment: To appear in Proceedings of EMNLP 201

    Conjunctures of democracy erosion:Is Brazil a global paradigm of resilience?

    Get PDF
    The paper aims to examine and understand the recent developments in Brazilian democracy in a sociological perspective. It offers an analysis of the conjunctural preconditions for the recent rise of authoritarian populism, presenting ways in which Brazil can be viewed as paradigm of democratic erosion and/or resilience. The article describes the foundational premises that made the development of contemporary democracies possible, and it proceeds from this description to explain how features common to authoritarian populist movements in Brazil and elsewhere are detrimental to these premises. It is argued that democracies are likely to thrive when welfare provisions and access to human rights are open to increasing sectors of the population, generating an inclusionary citizenship effect. The political polarization regarding the Brazilian welfare system and the discourse against international human rights, culminated in the weakening of the Brazilian welfare net and setbacks in the recognition of human rights by courts. These processes preceded, and were aggravated by, the rise of authoritarian populism in Brazil, generating an exclusionary view of citizenship that tended to intensify social conflict, with increasing militarization at both governmental and social levels. Arguably, the absence of warfare or an imminent warfare threat in the most recent democratic transition in Brazil reduced the capacity of welfare and constitutional human rights provisions to limit the influence of the military on democracy. While the efforts to build up the welfare system and protect human rights are still ongoing, the militarization element remains latent, posing a constant threat to democratic consolidation in Brazil

    Predicting suicide risk from online postings in Reddit : the UGent-IDLab submission to the CLPysch 2019 Shared Task A

    Get PDF
    This paper describes IDLab’s text classification systems submitted to Task A as part of the CLPsych 2019 shared task. The aim of this shared task was to develop automated systems that predict the degree of suicide risk of people based on their posts on Reddit. Bag-of-words features, emotion features and post level predictions are used to derive user-level predictions. Linear models and ensembles of these models are used to predict final scores. We find that predicting fine-grained risk levels is much more difficult than flagging potentially at-risk users. Furthermore, we do not find clear added value from building richer ensembles compared to simple baselines, given the available training data and the nature of the prediction task
    • …
    corecore