Search CORE

173 research outputs found

Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization

Author: Carbonell Jaime
Frederking Robert
Gershman Anatole
Marujo Luis
Neto João P.
Publication venue
Publication date: 20/06/2013
Field of study

Fast and effective automated indexing is critical for search and personalized services. Key phrases that consist of one or more words and represent the main concepts of the document are often used for the purpose of indexing. In this paper, we investigate the use of additional semantic features and pre-processing steps to improve automatic key phrase extraction. These features include the use of signal words and freebase categories. Some of these features lead to significant improvements in the accuracy of the results. We also experimented with 2 forms of document pre-processing that we call light filtering and co-reference normalization. Light filtering removes sentences from the document, which are judged peripheral to its main content. Co-reference normalization unifies several written forms of the same named entity into a unique form. We also needed a "Gold Standard" - a set of labeled documents for training and evaluation. While the subjective nature of key phrase selection precludes a true "Gold Standard", we used Amazon's Mechanical Turk service to obtain a useful approximation. Our data indicates that the biggest improvements in performance were due to shallow semantic features, news categories, and rhetorical signals (nDCG 78.47% vs. 68.93%). The inclusion of deeper semantic features such as Freebase sub-categories was not beneficial by itself, but in combination with pre-processing, did cause slight improvements in the nDCG scores.Comment: In 8th International Conference on Language Resources and Evaluation (LREC 2012

arXiv.org e-Print Archive

CiteSeerX

Supervised keyphrase extraction as positive unlabeled learning

Author: Caragea C
Demeester Thomas
Develder Chris
Sterckx Lucas
Publication venue
Publication date: 01/01/2016
Field of study

This paper shows that performance of trained keyphrase extractors approximates a classifier trained on articles labeled by multiple annotators, leading to higher average F₁ scores and better rankings of keyphrases

UNT Digital Library

Summarization of Films and Documentaries Based on Subtitles and Scripts

Author: Aparício Marta
de Matos David Martins
Figueiredo Paulo
Marujo Luís
Raposo Francisco
Ribeiro Ricardo
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

We assess the performance of generic text summarization algorithms applied to films and documentaries, using the well-known behavior of summarization of news articles as reference. We use three datasets: (i) news articles, (ii) film scripts and subtitles, and (iii) documentary subtitles. Standard ROUGE metrics are used for comparing generated summaries against news abstracts, plot summaries, and synopses. We show that the best performing algorithms are LSA, for news articles and documentaries, and LexRank and Support Sets, for films. Despite the different nature of films and documentaries, their relative behavior is in accordance with that obtained for news articles.Comment: 7 pages, 9 tables, 4 figures, submitted to Pattern Recognition Letters (Elsevier

arXiv.org e-Print Archive

Methoden voor efficiënte supervisie van automatische taalverwerking

Author: Sterckx Lucas
Publication venue
Publication date: 01/01/2018
Field of study