18,013 research outputs found
Topic Similarity Networks: Visual Analytics for Large Document Sets
We investigate ways in which to improve the interpretability of LDA topic
models by better analyzing and visualizing their outputs. We focus on examining
what we refer to as topic similarity networks: graphs in which nodes represent
latent topics in text collections and links represent similarity among topics.
We describe efficient and effective approaches to both building and labeling
such networks. Visualizations of topic models based on these networks are shown
to be a powerful means of exploring, characterizing, and summarizing large
collections of unstructured text documents. They help to "tease out"
non-obvious connections among different sets of documents and provide insights
into how topics form larger themes. We demonstrate the efficacy and
practicality of these approaches through two case studies: 1) NSF grants for
basic research spanning a 14 year period and 2) the entire English portion of
Wikipedia.Comment: 9 pages; 2014 IEEE International Conference on Big Data (IEEE BigData
2014
The 'who' and 'what' of #diabetes on Twitter
Social media are being increasingly used for health promotion, yet the
landscape of users, messages and interactions in such fora is poorly
understood. Studies of social media and diabetes have focused mostly on
patients, or public agencies addressing it, but have not looked broadly at all
the participants or the diversity of content they contribute. We study Twitter
conversations about diabetes through the systematic analysis of 2.5 million
tweets collected over 8 months and the interactions between their authors. We
address three questions: (1) what themes arise in these tweets?, (2) who are
the most influential users?, (3) which type of users contribute to which
themes? We answer these questions using a mixed-methods approach, integrating
techniques from anthropology, network science and information retrieval such as
thematic coding, temporal network analysis, and community and topic detection.
Diabetes-related tweets fall within broad thematic groups: health information,
news, social interaction, and commercial. At the same time, humorous messages
and references to popular culture appear consistently, more than any other type
of tweet. We classify authors according to their temporal 'hub' and 'authority'
scores. Whereas the hub landscape is diffuse and fluid over time, top
authorities are highly persistent across time and comprise bloggers, advocacy
groups and NGOs related to diabetes, as well as for-profit entities without
specific diabetes expertise. Top authorities fall into seven interest
communities as derived from their Twitter follower network. Our findings have
implications for public health professionals and policy makers who seek to use
social media as an engagement tool and to inform policy design.Comment: 25 pages, 11 figures, 7 tables. Supplemental spreadsheet available
from http://journals.sagepub.com/doi/suppl/10.1177/2055207616688841, Digital
Health, Vol 3, 201
Inferring Strategies for Sentence Ordering in Multidocument News Summarization
The problem of organizing information for multidocument summarization so that
the generated summary is coherent has received relatively little attention.
While sentence ordering for single document summarization can be determined
from the ordering of sentences in the input article, this is not the case for
multidocument summarization where summary sentences may be drawn from different
input articles. In this paper, we propose a methodology for studying the
properties of ordering information in the news genre and describe experiments
done on a corpus of multiple acceptable orderings we developed for the task.
Based on these experiments, we implemented a strategy for ordering information
that combines constraints from chronological order of events and topical
relatedness. Evaluation of our augmented algorithm shows a significant
improvement of the ordering over two baseline strategies
Abstractive Multi-Document Summarization via Phrase Selection and Merging
We propose an abstraction-based multi-document summarization framework that
can construct new sentences by exploring more fine-grained syntactic units than
sentences, namely, noun/verb phrases. Different from existing abstraction-based
approaches, our method first constructs a pool of concepts and facts
represented by phrases from the input documents. Then new sentences are
generated by selecting and merging informative phrases to maximize the salience
of phrases and meanwhile satisfy the sentence construction constraints. We
employ integer linear optimization for conducting phrase selection and merging
simultaneously in order to achieve the global optimal solution for a summary.
Experimental results on the benchmark data set TAC 2011 show that our framework
outperforms the state-of-the-art models under automated pyramid evaluation
metric, and achieves reasonably well results on manual linguistic quality
evaluation.Comment: 11 pages, 1 figure, accepted as a full paper at ACL 201
Determining citizens’ opinions about stories in the news media: analysing Google, Facebook and Twitter
We describe a method whereby a governmental policy maker can discover citizens’ reaction to news stories. This is particularly relevant in the political world, where governments’ policy statements are reported by the news media and discussed by citizens. The work here addresses two main questions: whereabouts are citizens discussing a news story, and what are they saying? Our strategy to answer the first question is to find news articles pertaining to the policy statements, then perform internet searches for references to the news articles’ headlines and URLs. We have created a software tool that schedules repeating Google searches for the news articles and collects the results in a database, enabling the user to aggregate and analyse them to produce ranked tables of sites that reference the news articles. Using data mining techniques we can analyse data so that resultant ranking reflects an overall aggregate score, taking into account multiple datasets, and this shows the most relevant places on the internet where the story is discussed. To answer the second question, we introduce the WeGov toolbox as a tool for analysing citizens’ comments and behaviour pertaining to news stories. We first use the tool for identifying social network discussions, using different strategies for Facebook and Twitter. We apply different analysis components to analyse the data to distil the essence of the social network users’ comments, to determine influential users and identify important comments
Event-based media monitoring methodology for Human Rights Watch
Executive Summary
This report, prepared by a team of researchers from the University of Minnesota for Human Rights Watch (HRW), investigates the use of event-based media monitoring (EMM) to review its application, identify its strengths and weaknesses, and offer suggestions on how HRW can better utilize EMM in its own work.
Media monitoring systems include both human-operated (manual) and automated systems, both of which we review throughout the report. The process begins with the selection of news sources, proceeds to the development of a coding manual (for manual searches) or “dictionary” (for automated searches), continues with gathering data, and concludes with the coding of news stories.
EMM enables the near real-time tracking of events reported by the media, allowing researchers to get a sense of the scope of and trends in an event, but there are limits to what EMM can accomplish on its own. The media will only cover a portion of a given event, so information will always be missing from EMM data. EMM also introduces research biases of various kinds; mitigating these biases requires careful selection of media sources and clearly defined coding manuals or dictionaries.
In manual EMM, coding the gathered data requires human researchers to apply codebook rules in order to collect consistent data from each story they read. In automated EMM, computers apply the dictionary directly to the news stories, automatically picking up the desired information. There are trade-offs in each system. Automated EMM can code stories far more quickly, but the software may incorrectly code stories, requiring manual corrections. Conversely, manual EMM allows for a more nuanced analysis, but the investment of time and effort may diminish the tool’s utility. We believe that both manual and automated EMM, when deployed correctly, can effectively support human rights research and advocacy
- …