224 research outputs found
Multi-Document Summarization via Discriminative Summary Reranking
Existing multi-document summarization systems usually rely on a specific
summarization model (i.e., a summarization method with a specific parameter
setting) to extract summaries for different document sets with different
topics. However, according to our quantitative analysis, none of the existing
summarization models can always produce high-quality summaries for different
document sets, and even a summarization model with good overall performance may
produce low-quality summaries for some document sets. On the contrary, a
baseline summarization model may produce high-quality summaries for some
document sets. Based on the above observations, we treat the summaries produced
by different summarization models as candidate summaries, and then explore
discriminative reranking techniques to identify high-quality summaries from the
candidates for difference document sets. We propose to extract a set of
candidate summaries for each document set based on an ILP framework, and then
leverage Ranking SVM for summary reranking. Various useful features have been
developed for the reranking process, including word-level features,
sentence-level features and summary-level features. Evaluation results on the
benchmark DUC datasets validate the efficacy and robustness of our proposed
approach
Enumeration of Extractive Oracle Summaries
To analyze the limitations and the future directions of the extractive
summarization paradigm, this paper proposes an Integer Linear Programming (ILP)
formulation to obtain extractive oracle summaries in terms of ROUGE-N. We also
propose an algorithm that enumerates all of the oracle summaries for a set of
reference summaries to exploit F-measures that evaluate which system summaries
contain how many sentences that are extracted as an oracle summary. Our
experimental results obtained from Document Understanding Conference (DUC)
corpora demonstrated the following: (1) room still exists to improve the
performance of extractive summarization; (2) the F-measures derived from the
enumerated oracle summaries have significantly stronger correlations with human
judgment than those derived from single oracle summaries.Comment: 12 page
Large-Margin Determinantal Point Processes
Determinantal point processes (DPPs) offer a powerful approach to modeling
diversity in many applications where the goal is to select a diverse subset. We
study the problem of learning the parameters (the kernel matrix) of a DPP from
labeled training data. We make two contributions. First, we show how to
reparameterize a DPP's kernel matrix with multiple kernel functions, thus
enhancing modeling flexibility. Second, we propose a novel parameter estimation
technique based on the principle of large margin separation. In contrast to the
state-of-the-art method of maximum likelihood estimation, our large-margin loss
function explicitly models errors in selecting the target subsets, and it can
be customized to trade off different types of errors (precision vs. recall).
Extensive empirical studies validate our contributions, including applications
on challenging document and video summarization, where flexibility in modeling
the kernel matrix and balancing different errors is indispensable.Comment: 15 page
Information overload in structured data
Information overload refers to the difficulty of making decisions caused by too much information. In this dissertation, we address information overload problem in two separate structured domains, namely, graphs and text.
Graph kernels have been proposed as an efficient and theoretically sound approach to compute graph similarity. They decompose graphs into certain sub-structures, such as subtrees, or subgraphs. However, existing graph kernels suffer from a few drawbacks. First, the dimension of the feature space associated with the kernel often grows exponentially as the complexity of sub-structures increase. One immediate consequence of this behavior is that small, non-informative, sub-structures occur more frequently and cause information overload. Second, as the number of features increase, we encounter sparsity: only a few informative sub-structures will co-occur in multiple graphs. In the first part of this dissertation, we propose to tackle the above problems by exploiting the dependency relationship among sub-structures. First, we propose a novel framework that learns the latent representations of sub-structures by leveraging recent advancements in deep learning. Second, we propose a general smoothing framework that takes structural similarity into account, inspired by state-of-the-art smoothing techniques used in natural language processing. Both the proposed frameworks are applicable to popular graph kernel families, and achieve significant performance improvements over state-of-the-art graph kernels.
In the second part of this dissertation, we tackle information overload in text. We first focus on a popular social news aggregation website, Reddit, and design a submodular recommender system that tailors a personalized frontpage for individual users. Second, we propose a novel submodular framework to summarize videos, where both transcript and comments are available. Third, we demonstrate how to apply filtering techniques to select a small subset of informative features from virtual machine logs in order to predict resource usage
Abstractive Multi-Document Summarization based on Semantic Link Network
The key to realize advanced document summarization is semantic representation of documents. This paper investigates the role of Semantic Link Network in representing and understanding documents for multi-document summarization. It proposes a novel abstractive multi-document summarization framework by first transforming documents into a Semantic Link Network of concepts and events and then transforming the Semantic Link Network into the summary of the documents based on the selection of important concepts and events while keeping semantics coherence. Experiments on benchmark datasets show that the proposed summarization approach significantly outperforms relevant state-of-the-art baselines and the Semantic Link Network plays an important role in representing and understanding documents
- …