67 research outputs found
Supervised Topical Key Phrase Extraction of News Stories using Crowdsourcing, Light Filtering and Co-reference Normalization
Fast and effective automated indexing is critical for search and personalized
services. Key phrases that consist of one or more words and represent the main
concepts of the document are often used for the purpose of indexing. In this
paper, we investigate the use of additional semantic features and
pre-processing steps to improve automatic key phrase extraction. These features
include the use of signal words and freebase categories. Some of these features
lead to significant improvements in the accuracy of the results. We also
experimented with 2 forms of document pre-processing that we call light
filtering and co-reference normalization. Light filtering removes sentences
from the document, which are judged peripheral to its main content.
Co-reference normalization unifies several written forms of the same named
entity into a unique form. We also needed a "Gold Standard" - a set of labeled
documents for training and evaluation. While the subjective nature of key
phrase selection precludes a true "Gold Standard", we used Amazon's Mechanical
Turk service to obtain a useful approximation. Our data indicates that the
biggest improvements in performance were due to shallow semantic features, news
categories, and rhetorical signals (nDCG 78.47% vs. 68.93%). The inclusion of
deeper semantic features such as Freebase sub-categories was not beneficial by
itself, but in combination with pre-processing, did cause slight improvements
in the nDCG scores.Comment: In 8th International Conference on Language Resources and Evaluation
(LREC 2012
A Model for Personalized Keyword Extraction from Web Pages using Segmentation
The World Wide Web caters to the needs of billions of users in heterogeneous
groups. Each user accessing the World Wide Web might have his / her own
specific interest and would expect the web to respond to the specific
requirements. The process of making the web to react in a customized manner is
achieved through personalization. This paper proposes a novel model for
extracting keywords from a web page with personalization being incorporated
into it. The keyword extraction problem is approached with the help of web page
segmentation which facilitates in making the problem simpler and solving it
effectively. The proposed model is implemented as a prototype and the
experiments conducted on it empirically validate the model's efficiency.Comment: 6 Pages, 2 Figure
Key Phrase Extraction of Lightly Filtered Broadcast News
This paper explores the impact of light filtering on automatic key phrase
extraction (AKE) applied to Broadcast News (BN). Key phrases are words and
expressions that best characterize the content of a document. Key phrases are
often used to index the document or as features in further processing. This
makes improvements in AKE accuracy particularly important. We hypothesized that
filtering out marginally relevant sentences from a document would improve AKE
accuracy. Our experiments confirmed this hypothesis. Elimination of as little
as 10% of the document sentences lead to a 2% improvement in AKE precision and
recall. AKE is built over MAUI toolkit that follows a supervised learning
approach. We trained and tested our AKE method on a gold standard made of 8 BN
programs containing 110 manually annotated news stories. The experiments were
conducted within a Multimedia Monitoring Solution (MMS) system for TV and radio
news/programs, running daily, and monitoring 12 TV and 4 radio channels.Comment: In 15th International Conference on Text, Speech and Dialogue (TSD
2012
Optimal Information Retrieval with Complex Utility Functions
Existing retrieval models all attempt to optimize one single utility function, which is often based on the topical relevance of a document with respect to a query. In real applications, retrieval involves more complex utility functions that may involve preferences on several different dimensions. In this paper, we present a general optimization framework for retrieval with complex utility functions. A query language is designed according to this framework to enable users to submit complex queries. We propose an efficient algorithm for retrieval with complex utility functions based on the a-priori algorithm. As a case study, we apply our algorithm to a complex utility retrieval problem in distributed IR. Experiment results show that our algorithm allows for flexible tradeoff between multiple retrieval criteria. Finally, we study the efficiency issue of our algorithm on simulated data
The Most Influential Paper Gerard Salton Never Wrote
Gerard Salton is often credited with developing the vector space model
(VSM) for information retrieval (IR). Citations to Salton give the impression
that the VSM must have been articulated as an IR model sometime between
1970 and 1975. However, the VSM as it is understood today evolved over a
longer time period than is usually acknowledged, and an articulation of the
model and its assumptions did not appear in print until several years after
those assumptions had been criticized and alternative models proposed. An
often cited overview paper titled ???A Vector Space Model for Information
Retrieval??? (alleged to have been published in 1975) does not exist, and
citations to it represent a confusion of two 1975 articles, neither of which
were overviews of the VSM as a model of information retrieval. Until the
late 1970s, Salton did not present vector spaces as models of IR generally
but rather as models of specifi c computations. Citations to the phantom
paper refl ect an apparently widely held misconception that the operational
features and explanatory devices now associated with the VSM must have
been introduced at the same time it was fi rst proposed as an IR model.published or submitted for publicatio
Формальний опис структурно-параметричних характеристик технічного тексту
Запропоновані структурно-параметричні характеристики текстових документів, наведені їх обґрунтування і відповідність задачам формальної оцінки блоків технічного документу згідно з рівнем ієрархії. Показано, що при умові виконання вимог вузькоспеціалізованого застосування, дані характеристики дозволяють отримати оцінки семантичної близькості, порівнянності і міру семантичної відповідності для структурних одиниць технічної природної мови.Proposed structural and parametric characteristics of text documents are the reasons for them, and compliance with an assessment of the challenges of formal blocks of a technical document in accordance with the level of the hierarchy. We show that, subject to compliance with the requirements of specialized applications, these characteristics will provide a semantic affinity evaluation, comparability and consistency to measure the semantic units of natural language technology
Diversification Based Static Index Pruning - Application to Temporal Collections
Nowadays, web archives preserve the history of large portions of the web. As
medias are shifting from printed to digital editions, accessing these huge
information sources is drawing increasingly more attention from national and
international institutions, as well as from the research community. These
collections are intrinsically big, leading to index files that do not fit into
the memory and an increase query response time. Decreasing the index size is a
direct way to decrease this query response time.
Static index pruning methods reduce the size of indexes by removing a part of
the postings. In the context of web archives, it is necessary to remove
postings while preserving the temporal diversity of the archive. None of the
existing pruning approaches take (temporal) diversification into account.
In this paper, we propose a diversification-based static index pruning
method. It differs from the existing pruning approaches by integrating
diversification within the pruning context. We aim at pruning the index while
preserving retrieval effectiveness and diversity by pruning while maximizing a
given IR evaluation metric like DCG. We show how to apply this approach in the
context of web archives. Finally, we show on two collections that search
effectiveness in temporal collections after pruning can be improved using our
approach rather than diversity oblivious approaches
Uma solução flexível para a etapa de pré-processamento em mineração de textos.
A mineração de textos é uma área de PD&I cujo objetivo é a busca por padrões, tendências e regularidades em documentos textuais. O primeiro e mais importante passo do processo de mineração de textos compreende um conjunto de procedimentos para leitura da coleção de documentos, identificação e a seleção dos atributos estatisticamente mais significantes para representar a coleção por meio de uma matriz atributo-valor. O objetivo deste trabalho é apresentar uma solução flexível e expansível para a tarefa de pré-processamento em mineração de textos, capaz de atender às necessidades de diferentes projetos de pesquisa envolvendo esta temática.CIIC 2012. No 12611
- …