18,013 research outputs found

    Topic Similarity Networks: Visual Analytics for Large Document Sets

    Full text link
    We investigate ways in which to improve the interpretability of LDA topic models by better analyzing and visualizing their outputs. We focus on examining what we refer to as topic similarity networks: graphs in which nodes represent latent topics in text collections and links represent similarity among topics. We describe efficient and effective approaches to both building and labeling such networks. Visualizations of topic models based on these networks are shown to be a powerful means of exploring, characterizing, and summarizing large collections of unstructured text documents. They help to "tease out" non-obvious connections among different sets of documents and provide insights into how topics form larger themes. We demonstrate the efficacy and practicality of these approaches through two case studies: 1) NSF grants for basic research spanning a 14 year period and 2) the entire English portion of Wikipedia.Comment: 9 pages; 2014 IEEE International Conference on Big Data (IEEE BigData 2014

    The 'who' and 'what' of #diabetes on Twitter

    Get PDF
    Social media are being increasingly used for health promotion, yet the landscape of users, messages and interactions in such fora is poorly understood. Studies of social media and diabetes have focused mostly on patients, or public agencies addressing it, but have not looked broadly at all the participants or the diversity of content they contribute. We study Twitter conversations about diabetes through the systematic analysis of 2.5 million tweets collected over 8 months and the interactions between their authors. We address three questions: (1) what themes arise in these tweets?, (2) who are the most influential users?, (3) which type of users contribute to which themes? We answer these questions using a mixed-methods approach, integrating techniques from anthropology, network science and information retrieval such as thematic coding, temporal network analysis, and community and topic detection. Diabetes-related tweets fall within broad thematic groups: health information, news, social interaction, and commercial. At the same time, humorous messages and references to popular culture appear consistently, more than any other type of tweet. We classify authors according to their temporal 'hub' and 'authority' scores. Whereas the hub landscape is diffuse and fluid over time, top authorities are highly persistent across time and comprise bloggers, advocacy groups and NGOs related to diabetes, as well as for-profit entities without specific diabetes expertise. Top authorities fall into seven interest communities as derived from their Twitter follower network. Our findings have implications for public health professionals and policy makers who seek to use social media as an engagement tool and to inform policy design.Comment: 25 pages, 11 figures, 7 tables. Supplemental spreadsheet available from http://journals.sagepub.com/doi/suppl/10.1177/2055207616688841, Digital Health, Vol 3, 201

    Inferring Strategies for Sentence Ordering in Multidocument News Summarization

    Full text link
    The problem of organizing information for multidocument summarization so that the generated summary is coherent has received relatively little attention. While sentence ordering for single document summarization can be determined from the ordering of sentences in the input article, this is not the case for multidocument summarization where summary sentences may be drawn from different input articles. In this paper, we propose a methodology for studying the properties of ordering information in the news genre and describe experiments done on a corpus of multiple acceptable orderings we developed for the task. Based on these experiments, we implemented a strategy for ordering information that combines constraints from chronological order of events and topical relatedness. Evaluation of our augmented algorithm shows a significant improvement of the ordering over two baseline strategies

    Abstractive Multi-Document Summarization via Phrase Selection and Merging

    Full text link
    We propose an abstraction-based multi-document summarization framework that can construct new sentences by exploring more fine-grained syntactic units than sentences, namely, noun/verb phrases. Different from existing abstraction-based approaches, our method first constructs a pool of concepts and facts represented by phrases from the input documents. Then new sentences are generated by selecting and merging informative phrases to maximize the salience of phrases and meanwhile satisfy the sentence construction constraints. We employ integer linear optimization for conducting phrase selection and merging simultaneously in order to achieve the global optimal solution for a summary. Experimental results on the benchmark data set TAC 2011 show that our framework outperforms the state-of-the-art models under automated pyramid evaluation metric, and achieves reasonably well results on manual linguistic quality evaluation.Comment: 11 pages, 1 figure, accepted as a full paper at ACL 201

    Determining citizens’ opinions about stories in the news media: analysing Google, Facebook and Twitter

    Get PDF
    We describe a method whereby a governmental policy maker can discover citizens’ reaction to news stories. This is particularly relevant in the political world, where governments’ policy statements are reported by the news media and discussed by citizens. The work here addresses two main questions: whereabouts are citizens discussing a news story, and what are they saying? Our strategy to answer the first question is to find news articles pertaining to the policy statements, then perform internet searches for references to the news articles’ headlines and URLs. We have created a software tool that schedules repeating Google searches for the news articles and collects the results in a database, enabling the user to aggregate and analyse them to produce ranked tables of sites that reference the news articles. Using data mining techniques we can analyse data so that resultant ranking reflects an overall aggregate score, taking into account multiple datasets, and this shows the most relevant places on the internet where the story is discussed. To answer the second question, we introduce the WeGov toolbox as a tool for analysing citizens’ comments and behaviour pertaining to news stories. We first use the tool for identifying social network discussions, using different strategies for Facebook and Twitter. We apply different analysis components to analyse the data to distil the essence of the social network users’ comments, to determine influential users and identify important comments

    Event-based media monitoring methodology for Human Rights Watch

    Get PDF
    Executive Summary This report, prepared by a team of researchers from the University of Minnesota for Human Rights Watch (HRW), investigates the use of event-based media monitoring (EMM) to review its application, identify its strengths and weaknesses, and offer suggestions on how HRW can better utilize EMM in its own work. Media monitoring systems include both human-operated (manual) and automated systems, both of which we review throughout the report. The process begins with the selection of news sources, proceeds to the development of a coding manual (for manual searches) or “dictionary” (for automated searches), continues with gathering data, and concludes with the coding of news stories. EMM enables the near real-time tracking of events reported by the media, allowing researchers to get a sense of the scope of and trends in an event, but there are limits to what EMM can accomplish on its own. The media will only cover a portion of a given event, so information will always be missing from EMM data. EMM also introduces research biases of various kinds; mitigating these biases requires careful selection of media sources and clearly defined coding manuals or dictionaries. In manual EMM, coding the gathered data requires human researchers to apply codebook rules in order to collect consistent data from each story they read. In automated EMM, computers apply the dictionary directly to the news stories, automatically picking up the desired information. There are trade-offs in each system. Automated EMM can code stories far more quickly, but the software may incorrectly code stories, requiring manual corrections. Conversely, manual EMM allows for a more nuanced analysis, but the investment of time and effort may diminish the tool’s utility. We believe that both manual and automated EMM, when deployed correctly, can effectively support human rights research and advocacy
    • …
    corecore