37 research outputs found
A Framework for Comparing Groups of Documents
We present a general framework for comparing multiple groups of documents. A
bipartite graph model is proposed where document groups are represented as one
node set and the comparison criteria are represented as the other node set.
Using this model, we present basic algorithms to extract insights into
similarities and differences among the document groups. Finally, we demonstrate
the versatility of our framework through an analysis of NSF funding programs
for basic research.Comment: 6 pages; 2015 Conference on Empirical Methods in Natural Language
Processing (EMNLP '15
Topic Similarity Networks: Visual Analytics for Large Document Sets
We investigate ways in which to improve the interpretability of LDA topic
models by better analyzing and visualizing their outputs. We focus on examining
what we refer to as topic similarity networks: graphs in which nodes represent
latent topics in text collections and links represent similarity among topics.
We describe efficient and effective approaches to both building and labeling
such networks. Visualizations of topic models based on these networks are shown
to be a powerful means of exploring, characterizing, and summarizing large
collections of unstructured text documents. They help to "tease out"
non-obvious connections among different sets of documents and provide insights
into how topics form larger themes. We demonstrate the efficacy and
practicality of these approaches through two case studies: 1) NSF grants for
basic research spanning a 14 year period and 2) the entire English portion of
Wikipedia.Comment: 9 pages; 2014 IEEE International Conference on Big Data (IEEE BigData
2014
Dirichlet Methods for Bayesian Source Detection in Radio Astronomy Images
The sheer volume of data to be produced by the next generation of radio telescopes—exabytes of data on hundreds of millions of objects—makes automated methods for the detection of astronomical objects ("sources") essential. Of particular importance are low surface brightness objects, which are not well found by current automated methods.
This thesis explores Bayesian methods for source detection that use Dirichlet or multinomial models for pixel intensity distributions in discretised radio astronomy images. A novel image discretisation method that incorporates uncertainty about how the image should be discretised is developed. Latent Dirichlet allocation — a method originally developed for inferring latent topics in document collections — is used to estimate source and background distributions in radio astronomy images. A new Dirichlet-multinomial ratio, indicating how well a region conforms to a well-specified model of background versus a loosely-specified model of foreground, is derived. Finally, latent Dirichlet allocation and the Dirichlet-multinomial ratio are combined for source detection in astronomical images.
The methods developed in this thesis perform source detection well in comparison to two widely-used source detection packages and, importantly, find dim sources not well found by other algorithms
Proceedings of the Workshop Semantic Content Acquisition and Representation (SCAR) 2007
This is the proceedings of the Workshop on Semantic Content Acquisition and Representation, held in conjunction with NODALIDA 2007, on May 24 2007 in Tartu, Estonia.</p