391 research outputs found

    Induction of Word and Phrase Alignments for Automatic Document Summarization

    Full text link
    Current research in automatic single document summarization is dominated by two effective, yet naive approaches: summarization by sentence extraction, and headline generation via bag-of-words models. While successful in some tasks, neither of these models is able to adequately capture the large set of linguistic devices utilized by humans when they produce summaries. One possible explanation for the widespread use of these models is that good techniques have been developed to extract appropriate training data for them from existing document/abstract and document/headline corpora. We believe that future progress in automatic summarization will be driven both by the development of more sophisticated, linguistically informed models, as well as a more effective leveraging of document/abstract corpora. In order to open the doors to simultaneously achieving both of these goals, we have developed techniques for automatically producing word-to-word and phrase-to-phrase alignments between documents and their human-written abstracts. These alignments make explicit the correspondences that exist in such document/abstract pairs, and create a potentially rich data source from which complex summarization algorithms may learn. This paper describes experiments we have carried out to analyze the ability of humans to perform such alignments, and based on these analyses, we describe experiments for creating them automatically. Our model for the alignment task is based on an extension of the standard hidden Markov model, and learns to create alignments in a completely unsupervised fashion. We describe our model in detail and present experimental results that show that our model is able to learn to reliably identify word- and phrase-level alignments in a corpus of pairs

    New Alignment Methods for Discriminative Book Summarization

    Full text link
    We consider the unsupervised alignment of the full text of a book with a human-written summary. This presents challenges not seen in other text alignment problems, including a disparity in length and, consequent to this, a violation of the expectation that individual words and phrases should align, since large passages and chapters can be distilled into a single summary phrase. We present two new methods, based on hidden Markov models, specifically targeted to this problem, and demonstrate gains on an extractive book summarization task. While there is still much room for improvement, unsupervised alignment holds intrinsic value in offering insight into what features of a book are deemed worthy of summarization.Comment: This paper reflects work in progres

    Translation Alignment Applied to Historical Languages: methods, evaluation, applications, and visualization

    Get PDF
    Translation alignment is an essential task in Digital Humanities and Natural Language Processing, and it aims to link words/phrases in the source text with their translation equivalents in the translation. In addition to its importance in teaching and learning historical languages, translation alignment builds bridges between ancient and modern languages through which various linguistics annotations can be transferred. This thesis focuses on word-level translation alignment applied to historical languages in general and Ancient Greek and Latin in particular. As the title indicates, the thesis addresses four interdisciplinary aspects of translation alignment. The starting point was developing Ugarit, an interactive annotation tool to perform manual alignment aiming to gather training data to train an automatic alignment model. This effort resulted in more than 190k accurate translation pairs that I used for supervised training later. Ugarit has been used by many researchers and scholars also in the classroom at several institutions for teaching and learning ancient languages, which resulted in a large, diverse crowd-sourced aligned parallel corpus allowing us to conduct experiments and qualitative analysis to detect recurring patterns in annotators’ alignment practice and the generated translation pairs. Further, I employed the recent advances in NLP and language modeling to develop an automatic alignment model for historical low-resourced languages, experimenting with various training objectives and proposing a training strategy for historical languages that combines supervised and unsupervised training with mono- and multilingual texts. Then, I integrated this alignment model into other development workflows to project cross-lingual annotations and induce bilingual dictionaries from parallel corpora. Evaluation is essential to assess the quality of any model. To ensure employing the best practice, I reviewed the current evaluation procedure, defined its limitations, and proposed two new evaluation metrics. Moreover, I introduced a visual analytics framework to explore and inspect alignment gold standard datasets and support quantitative and qualitative evaluation of translation alignment models. Besides, I designed and implemented visual analytics tools and reading environments for parallel texts and proposed various visualization approaches to support different alignment-related tasks employing the latest advances in information visualization and best practice. Overall, this thesis presents a comprehensive study that includes manual and automatic alignment techniques, evaluation methods and visual analytics tools that aim to advance the field of translation alignment for historical languages

    Automatic analysis of semantic similarity in comparable text through syntactic tree matching

    Get PDF

    Explorations in Sentence Fusion

    Get PDF

    Analysis of translational correspondence in view of sub-sentential alignment

    Get PDF
    This paper reports on the first results of an empirical study of translational corre-spondence in different text types for the English-Dutch language pair. A Gold Standard was created, which can be used as a standard data set for evaluating sub-sentential alignment. The manually indi-cated translational correspondences were analyzed in view of different heuristics used in existing sub-sentential alignment modules.
    • …
    corecore