1 research outputs found

    Utterance Topic Model for Generating Coherent Summaries

    No full text
    Generating short multi-document summaries has received a lot of focus recently and is useful in many respects including summarizing answers to a question in an online scenario like Yahoo! Answers. The focus of this paper is to attempt to define a new probabilistic topic model that includes the semantic roles of the words in the document generation process. Words always carry syntactic and semantic information and often such information, for e.g., the grammatical and semantic role (henceforth GSR) of a word like Subject, Verb, Object, Adjective qualifiers, WordNet and VerbNet role assignments etc. is carried across adjacent sentences to enhance local coherence in different parts of a document. A statistical topic model like LDA[5] usually models topics as distributions over the word count vocabulary only. We posit that a document could first be topic modeled over a vocabulary of GSR transitions and then corresponding to each transition, words and and hence sentences can be sampled to best describe the transition. Thus the topics in the proposed model also lend themselves to be distributions over the GSR transitions implicitly. We also later show how this basic model can be extended to a model for query focused summarization where for a particular query, sentences can be ranked by a product of thematical salience and coherence through GSR transitions. We empirically show that the new topic model had lower test set perplexity than LDA and we also analyze the performance of our summarization model using the ROUGE[13] on DUC2005 dataset 1 and PYRA-MID[17] on the TAC2008 2 and TAC2009 3 datasets