28 research outputs found
Non-Compositional Term Dependence for Information Retrieval
Modelling term dependence in IR aims to identify co-occurring terms that are
too heavily dependent on each other to be treated as a bag of words, and to
adapt the indexing and ranking accordingly. Dependent terms are predominantly
identified using lexical frequency statistics, assuming that (a) if terms
co-occur often enough in some corpus, they are semantically dependent; (b) the
more often they co-occur, the more semantically dependent they are. This
assumption is not always correct: the frequency of co-occurring terms can be
separate from the strength of their semantic dependence. E.g. "red tape" might
be overall less frequent than "tape measure" in some corpus, but this does not
mean that "red"+"tape" are less dependent than "tape"+"measure". This is
especially the case for non-compositional phrases, i.e. phrases whose meaning
cannot be composed from the individual meanings of their terms (such as the
phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction
between the frequency and strength of term dependence in IR, we present a
principled approach for handling term dependence in queries, using both lexical
frequency and semantic evidence. We focus on non-compositional phrases,
extending a recent unsupervised model for their detection [21] to IR. Our
approach, integrated into ranking using Markov Random Fields [31], yields
effectiveness gains over competitive TREC baselines, showing that there is
still room for improvement in the very well-studied area of term dependence in
IR
New information retrieval systems
L'article pretén donar una visió panoràmica de la investigació que s'ha realitzat d'aquesta nova generació de sistemes de recuperació de la informació, tot describint-ne els seus components més importants i li·lustrant-ho amb exemples basats en aquests nous principis que ja s'estiguin utilitzant.This article offers an overall view of the research that has been conducted, through descriptions of the main components of this new generation of information retrieval systems. Contains examples of systems currently in ise that are based upon these principles
Automatic Semantic Header Generator
As the mounds of information and the number of Internet users grow, the problem of indexing
and retrieving of electronic information resources becomes more critical. The existing search
systems tend to generate misses and false hits due to the fact that they attempt to match
the speci ed search terms without proper context in the target information resource. In
environments that contain many di erent types of data, content indexing requires type-
speci c processing to extract indexing information e ectively. The COncordia INdexing and
DIscovery (Cindi) system is a system devised to support the registration of indexing meta-
data for information resources and provide a convenient system for search and discovery.
The Semantic Header, containing the semantic contents of information resources stored in
the Cindi system, provides a useful tool to facilitate the searching for documents based on a
number of commonly used criteria. This paper presents an automatic tool for the extraction
and storage of some of the meta-information in a Semantic Header and the classi cation
scheme used for generating the subject headings
Extracting Semantics of Documents Using Semantic Header Generator
Accurate representation of electronic information on the Internet underlies a solid foundation for precise information retrieval. However, the existing search systems tend to generate misses and false hits due to the fact that they attempt to match the specified search terms without context in the target information resource. It is clear that using traditional keywords-based methods for representing semantics of information items has become a major obstacle to high precision. In this paper, we propose the notion of Semantic Header to replace keyword indexing in extracting the meanings of information resources
that marks explicitly the logical structure of a document. The information from the Semantic Header could be used by the search system to help locate appropriate documents with minimum effort. We also introduce an automatic tool, called Automatic Semantic Header Generator (ASHG), used for generating the meta-information for some significant fields of Semantic Header
Representação de conteúdo via indexação automática em textos integrais em língua portuguesa
Verifica-se a possibilidade da indexação automática derivativa de textos em língua portuguesa, a partir de seu texto integral. É aplicada a Fórmula de Transição de Goffman a 10 artigos na área de Bibliometria e formulado um algorltimo probabilístico de indexação. A Fórmula de Transição de Goffman ê perfeitamente aplicável à língua portuguesa, apontando para uma região de frequência de palavras onde estão concentradas as palavras indicativas do conteúdo dos artigos analisados.
Palavras-chave
Recuperação da informação. Indexação automática derivativa. Fórmula de Transição de Goffman.
Representation of contents by the automatic indexing process of full texts in Portuguese language
Abstract
Possibility of automatic derived indexing of full texts in Portuguese is verifyed. Ten papers in Bibliometrics were indexed and their different parts considered for quantitative and qualitative analysis. Structure and disíríbution patterns of words were studied. Goffman's transition formula proved to be adequate as a slarting point for the indexing algorithm, which yielded, in all papers, a concentration zone forsemantic loaded terms. The algorithm worked as an uncenainty reducer, feading to the semantically important words.
Keywords
Information retrieval. Automatic derived indexing. Goffman's transition formula