Search CORE

8 research outputs found

Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization

Author: Carbonell Jaime
Frederking Robert
Gershman Anatole
Marujo Luis
Neto João P.
Publication venue
Publication date: 20/06/2013
Field of study

Fast and effective automated indexing is critical for search and personalized services. Key phrases that consist of one or more words and represent the main concepts of the document are often used for the purpose of indexing. In this paper, we investigate the use of additional semantic features and pre-processing steps to improve automatic key phrase extraction. These features include the use of signal words and freebase categories. Some of these features lead to significant improvements in the accuracy of the results. We also experimented with 2 forms of document pre-processing that we call light filtering and co-reference normalization. Light filtering removes sentences from the document, which are judged peripheral to its main content. Co-reference normalization unifies several written forms of the same named entity into a unique form. We also needed a "Gold Standard" - a set of labeled documents for training and evaluation. While the subjective nature of key phrase selection precludes a true "Gold Standard", we used Amazon's Mechanical Turk service to obtain a useful approximation. Our data indicates that the biggest improvements in performance were due to shallow semantic features, news categories, and rhetorical signals (nDCG 78.47% vs. 68.93%). The inclusion of deeper semantic features such as Freebase sub-categories was not beneficial by itself, but in combination with pre-processing, did cause slight improvements in the nDCG scores.Comment: In 8th International Conference on Language Resources and Evaluation (LREC 2012

arXiv.org e-Print Archive

CiteSeerX

Key Phrase Extraction of Lightly Filtered Broadcast News

Author: Carbonell Jaime
de Matos David Martins
Gershman Anatole
Marujo Luis
Neto João P.
Ribeiro Ricardo
Publication venue
Publication date: 01/01/2012
Field of study

This paper explores the impact of light filtering on automatic key phrase extraction (AKE) applied to Broadcast News (BN). Key phrases are words and expressions that best characterize the content of a document. Key phrases are often used to index the document or as features in further processing. This makes improvements in AKE accuracy particularly important. We hypothesized that filtering out marginally relevant sentences from a document would improve AKE accuracy. Our experiments confirmed this hypothesis. Elimination of as little as 10% of the document sentences lead to a 2% improvement in AKE precision and recall. AKE is built over MAUI toolkit that follows a supervised learning approach. We trained and tested our AKE method on a gold standard made of 8 BN programs containing 110 manually annotated news stories. The experiments were conducted within a Multimedia Monitoring Solution (MMS) system for TV and radio news/programs, running daily, and monitoring 12 TV and 4 radio channels.Comment: In 15th International Conference on Text, Speech and Dialogue (TSD 2012

arXiv.org e-Print Archive

Crossref

Repositório Institucional do ISCTE-IUL

Summarization of Films and Documentaries Based on Subtitles and Scripts

Author: Aparício Marta
de Matos David Martins
Figueiredo Paulo
Marujo Luís
Raposo Francisco
Ribeiro Ricardo
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

We assess the performance of generic text summarization algorithms applied to films and documentaries, using the well-known behavior of summarization of news articles as reference. We use three datasets: (i) news articles, (ii) film scripts and subtitles, and (iii) documentary subtitles. Standard ROUGE metrics are used for comparing generated summaries against news abstracts, plot summaries, and synopses. We show that the best performing algorithms are LSA, for news articles and documentaries, and LexRank and Support Sets, for films. Despite the different nature of films and documentaries, their relative behavior is in accordance with that obtained for news articles.Comment: 7 pages, 9 tables, 4 figures, submitted to Pattern Recognition Letters (Elsevier

arXiv.org e-Print Archive

Repositório Institucional do ISCTE-IUL

A New Unsupervised Technique to Analyze the Centroid and Frequency of Keyphrases from Academic Articles

Author: A. S. M. Sanwar Hosen
Miah Mohammad Badrul Alam
Ra In-Ho
Rahman Md Mustafizur
Suryanti Awang
Publication venue: 'MDPI AG'
Publication date: 01/01/2022
Field of study

Automated keyphrase extraction is crucial for extracting and summarizing relevant information from a variety of publications in multiple domains. However, the extraction of good-quality keyphrases and the summarising of information to a good standard have become extremely challenging in recent research because of the advancement of technology and the exponential development of digital sources and textual information. Because of this, the usage of keyphrase features for keyphrase extraction techniques has recently gained tremendous popularity. This paper proposed a new unsupervised region-based keyphrase centroid and frequency analysis technique, named the KCFA technique, for keyphrase extraction as a feature. Data/datasets collection, data pre-processing, statistical methodologies, curve plotting analysis, and curve fitting technique are the five main processes in the proposed technique. To begin, the technique collects multiple datasets from diverse sources, which are then input into the data pre-processing step by utilizing some text pre-processing processes. Afterward, the region-based statistical methodologies receive the pre-processed data, followed by the curve plotting examination and, lastly, the curve fitting technique. The proposed technique is then tested and evaluated using ten (10) best-accessible benchmark datasets from various disciplines. The proposed approach is then compared to our available methods to demonstrate its efficacy, advantages, and importance. Lastly, the results of the experiment show that the proposed method works well to analyze the centroid and frequency of keyphrases from academic articles. It provides a centroid of 706.66 and a frequency of 38.95% in the first region, 2454.21 and 7.98% in the second region, for a total frequency of 68.11

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

UMP Institutional Repository

Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

Author
Publication venue: The Association for Computational Linguistics
Publication date: 19/04/2021
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Recent Advances in Social Data and Artificial Intelligence 2019

Author
Publication venue: 'MDPI AG'
Publication date: 12/08/2022
Field of study

The importance and usefulness of subjects and topics involving social data and artificial intelligence are becoming widely recognized. This book contains invited review, expository, and original research articles dealing with, and presenting state-of-the-art accounts pf, the recent advances in the subjects of social data and artificial intelligence, and potentially their links to Cyberspace

Directory of Open Access Books (DOAB)