Search CORE

8 research outputs found

Concise comparative summaries (CCS) of large text corpora with a human experiment

Author: Barnesmoore Luke
Clavier Sophie
El Ghaoui Laurent
Gawalt Brian
Jia Jinzhu
Miratrix Luke Weisman
Yu Bin
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2014
Field of study

In this paper we propose a general framework for topic-specific summarization of large text corpora and illustrate how it can be used for the analysis of news databases. Our framework, concise comparative summarization (CCS), is built on sparse classification methods. CCS is a lightweight and flexible tool that offers a compromise between simple word frequency based methods currently in wide use and more heavyweight, model-intensive methods such as latent Dirichlet allocation (LDA). We argue that sparse methods have much to offer for text analysis and hope CCS opens the door for a new branch of research in this important field. For a particular topic of interest (e.g., China or energy), CSS automatically labels documents as being either on- or off-topic (usually via keyword search), and then uses sparse classification methods to predict these labels with the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary. To validate our tool, we, using news articles from the New York Times international section, designed and conducted a human survey to compare the different summarizers with human understanding. We demonstrate our approach with two case studies, a media analysis of the framing of “Egypt” in the New York Times throughout the Arab Spring and an informal comparison of the New York Times’ and Wall Street Journal’s coverage of “energy.” Overall, we find that the Lasso with L2 normalization can be effectively and usefully used to summarize large corpora, regardless of document size.Statistic

arXiv.org e-Print Archive

Harvard University - DASH

Convex Approaches to Text Summarization

Author: Gawalt Brian
Publication venue: eScholarship, University of California
Publication date: 01/01/2012
Field of study

This dissertation presents techniques for the summarization and exploration of text documents. Many approaches taken towards analysis of news media can be analogized to well-defined, well-studied problems from statistical machine learning. The problem of feature selection, for classification and dimensionality reduction tasks, is formulated to help assist with these media analysis tasks. Taking advantage of L1 regularization, convex programs can be used to efficiently solve these feature selection problems efficiently. There is a demonstrated potential to conduct media analysis at a scale commensurate with the growing volume of data available to news consumers. There is first a presentation of an example text mining over a vector space model. Given two news articles on a related theme, a series of additional articles are pulled from a large pool of candidates to help link these two input items. The novel algorithm used is based on finding the documents whose vector representations are nearest the convex combinations of the inputs. Comparisons to competing algorithms show performance matching a state-of-the-art method, at a lower computational complexity.Design of a relational database for typical text mining tasks is discussed. The architecture trades off the organizational and data quality advantages of normalization versus the performance boosts from replicating entity attributes across tables. The vector space model of text is implemented explicitly as a three-column table.The predictive framework, connecting news analysis tasks to feature selection and classification problems, is then explicitly explored. The validity of this analogy is tested with a particular task: given a query term and a corpus of news articles, provide a short list of word tokens which distinguish how this word appears within the corpus. Example summary lists were produced by five algorithms, and presented to volunteer readers. Evidence suggests that an implementation of L1-regularized logistic regression model, trained over the documents with labels indicating the presence or absence of the query word, selected word-features best summarizing the query.To contend with tasks that do not lend themselves this a predictive framework, a sparse variant of latent semantic indexing is investigated. [cont.

CiteSeerX

eScholarship - University of California