Search CORE

67 research outputs found

Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization

Author: Carbonell Jaime
Frederking Robert
Gershman Anatole
Marujo Luis
Neto João P.
Publication venue
Publication date: 20/06/2013
Field of study

Fast and effective automated indexing is critical for search and personalized services. Key phrases that consist of one or more words and represent the main concepts of the document are often used for the purpose of indexing. In this paper, we investigate the use of additional semantic features and pre-processing steps to improve automatic key phrase extraction. These features include the use of signal words and freebase categories. Some of these features lead to significant improvements in the accuracy of the results. We also experimented with 2 forms of document pre-processing that we call light filtering and co-reference normalization. Light filtering removes sentences from the document, which are judged peripheral to its main content. Co-reference normalization unifies several written forms of the same named entity into a unique form. We also needed a "Gold Standard" - a set of labeled documents for training and evaluation. While the subjective nature of key phrase selection precludes a true "Gold Standard", we used Amazon's Mechanical Turk service to obtain a useful approximation. Our data indicates that the biggest improvements in performance were due to shallow semantic features, news categories, and rhetorical signals (nDCG 78.47% vs. 68.93%). The inclusion of deeper semantic features such as Freebase sub-categories was not beneficial by itself, but in combination with pre-processing, did cause slight improvements in the nDCG scores.Comment: In 8th International Conference on Language Resources and Evaluation (LREC 2012

arXiv.org e-Print Archive

CiteSeerX

A Model for Personalized Keyword Extraction from Web Pages using Segmentation

Author: Aghila G.
Kuppusamy K. S.
Publication venue: 'Foundation of Computer Science'
Publication date: 02/04/2012
Field of study

The World Wide Web caters to the needs of billions of users in heterogeneous groups. Each user accessing the World Wide Web might have his / her own specific interest and would expect the web to respond to the specific requirements. The process of making the web to react in a customized manner is achieved through personalization. This paper proposes a novel model for extracting keywords from a web page with personalization being incorporated into it. The keyword extraction problem is approached with the help of web page segmentation which facilitates in making the problem simpler and solving it effectively. The proposed model is implemented as a prototype and the experiments conducted on it empirically validate the model's efficiency.Comment: 6 Pages, 2 Figure

arXiv.org e-Print Archive

Crossref

Key Phrase Extraction of Lightly Filtered Broadcast News

Author: Carbonell Jaime
de Matos David Martins
Gershman Anatole
Marujo Luis
Neto João P.
Ribeiro Ricardo
Publication venue
Publication date: 01/01/2012
Field of study

This paper explores the impact of light filtering on automatic key phrase extraction (AKE) applied to Broadcast News (BN). Key phrases are words and expressions that best characterize the content of a document. Key phrases are often used to index the document or as features in further processing. This makes improvements in AKE accuracy particularly important. We hypothesized that filtering out marginally relevant sentences from a document would improve AKE accuracy. Our experiments confirmed this hypothesis. Elimination of as little as 10% of the document sentences lead to a 2% improvement in AKE precision and recall. AKE is built over MAUI toolkit that follows a supervised learning approach. We trained and tested our AKE method on a gold standard made of 8 BN programs containing 110 manually annotated news stories. The experiments were conducted within a Multimedia Monitoring Solution (MMS) system for TV and radio news/programs, running daily, and monitoring 12 TV and 4 radio channels.Comment: In 15th International Conference on Text, Speech and Dialogue (TSD 2012

arXiv.org e-Print Archive

Crossref

Repositório Institucional do ISCTE-IUL

Optimal Information Retrieval with Complex Utility Functions

Author: Tao Tao
Zhai ChengXiang
Publication venue
Publication date: 01/04/2004
Field of study

Existing retrieval models all attempt to optimize one single utility function, which is often based on the topical relevance of a document with respect to a query. In real applications, retrieval involves more complex utility functions that may involve preferences on several different dimensions. In this paper, we present a general optimization framework for retrieval with complex utility functions. A query language is designed according to this framework to enable users to submit complex queries. We propose an efficient algorithm for retrieval with complex utility functions based on the a-priori algorithm. As a case study, we apply our algorithm to a complex utility retrieval problem in distributed IR. Experiment results show that our algorithm allows for flexible tradeoff between multiple retrieval criteria. Finally, we study the efficiency issue of our algorithm on simulated data

Illinois Digital Environment for Access to Learning and Scholarship Repository

The Most Influential Paper Gerard Salton Never Wrote

Author: Dubin David
Publication venue: Graduate School of Library and Information Science. University of Illinois at Urbana-Champaign.
Publication date: 01/01/2004
Field of study

Gerard Salton is often credited with developing the vector space model (VSM) for information retrieval (IR). Citations to Salton give the impression that the VSM must have been articulated as an IR model sometime between 1970 and 1975. However, the VSM as it is understood today evolved over a longer time period than is usually acknowledged, and an articulation of the model and its assumptions did not appear in print until several years after those assumptions had been criticized and alternative models proposed. An often cited overview paper titled ???A Vector Space Model for Information Retrieval??? (alleged to have been published in 1975) does not exist, and citations to it represent a confusion of two 1975 articles, neither of which were overviews of the VSM as a model of information retrieval. Until the late 1970s, Salton did not present vector spaces as models of IR generally but rather as models of specifi c computations. Citations to the phantom paper refl ect an apparently widely held misconception that the operational features and explanatory devices now associated with the VSM must have been introduced at the same time it was fi rst proposed as an IR model.published or submitted for publicatio

Illinois Digital Environment for Access to Learning and Scholarship Repository

Формальний опис структурно-параметричних характеристик технічного тексту

Author: Стіренко Сергій Григорович
Publication venue: Київ
Publication date: 01/01/2008
Field of study

Запропоновані структурно-параметричні характеристики текстових документів, наведені їх обґрунтування і відповідність задачам формальної оцінки блоків технічного документу згідно з рівнем ієрархії. Показано, що при умові виконання вимог вузькоспеціалізованого застосування, дані характеристики дозволяють отримати оцінки семантичної близькості, порівнянності і міру семантичної відповідності для структурних одиниць технічної природної мови.Proposed structural and parametric characteristics of text documents are the reasons for them, and compliance with an assessment of the challenges of formal blocks of a technical document in accordance with the level of the hierarchy. We show that, subject to compliance with the requirements of specialized applications, these characteristics will provide a semantic affinity evaluation, comparability and consistency to measure the semantic units of natural language technology

Electronic Archive of Kyiv Polytechnic Institute

Diversification Based Static Index Pruning - Application to Temporal Collections

Author: Gançarski Stéphane
Pehlivan Zeynep
Piwowarski Benjamin
Publication venue
Publication date: 01/08/2013
Field of study

Nowadays, web archives preserve the history of large portions of the web. As medias are shifting from printed to digital editions, accessing these huge information sources is drawing increasingly more attention from national and international institutions, as well as from the research community. These collections are intrinsically big, leading to index files that do not fit into the memory and an increase query response time. Decreasing the index size is a direct way to decrease this query response time. Static index pruning methods reduce the size of indexes by removing a part of the postings. In the context of web archives, it is necessary to remove postings while preserving the temporal diversity of the archive. None of the existing pruning approaches take (temporal) diversification into account. In this paper, we propose a diversification-based static index pruning method. It differs from the existing pruning approaches by integrating diversification within the pruning context. We aim at pruning the index while preserving retrieval effectiveness and diversity by pruning while maximizing a given IR evaluation metric like DCG. We show how to apply this approach in the context of web archives. Finally, we show on two collections that search effectiveness in temporal collections after pruning can be improved using our approach rather than diversity oblivious approaches

arXiv.org e-Print Archive

HAL Descartes

Hal-Diderot

Uma solução flexível para a etapa de pré-processamento em mineração de textos.

Author: CRUZ S. A. B.
HIGA R. H.
MOURA M. F.
YAMADA A. K.
Publication venue
Publication date: 23/01/2020
Field of study

A mineração de textos é uma área de PD&I cujo objetivo é a busca por padrões, tendências e regularidades em documentos textuais. O primeiro e mais importante passo do processo de mineração de textos compreende um conjunto de procedimentos para leitura da coleção de documentos, identificação e a seleção dos atributos estatisticamente mais significantes para representar a coleção por meio de uma matriz atributo-valor. O objetivo deste trabalho é apresentar uma solução flexível e expansível para a tarefa de pré-processamento em mineração de textos, capaz de atender às necessidades de diferentes projetos de pesquisa envolvendo esta temática.CIIC 2012. No 12611

Repository Open Access to Scientific Information from Embrapa