375 research outputs found
Report on the XML Mining Track at INEX 2005 and INEX 2006, Categorization and Clustering of XML Documents
International audienceThis article is a report concerning the two years of the XML Mining track at INEX (2005 and 2006). We focus here on the classification and clustering XML documents. We detail these two tasks and the corpus used for this challenge and then present a summary of the different methods proposed by the participants. We last compare the results obtained during the two years of the track
XML documents clustering using a tensor space model
The traditional Vector Space Model (VSM) is not able to represent both the structure and the content of XML documents. This paper introduces a novel method of representing XML documents in a Tensor Space Model (TSM) and then utilizing it for clustering. Empirical analysis shows that the proposed method is scalable for large-sized datasets; as well, the factorized matrices produced from the proposed method help to improve the quality of clusters through the enriched document representation of both structure and content information
The State-of-the-arts in Focused Search
The continuous influx of various text data on the Web requires search engines to improve their retrieval abilities for more specific information. The need for relevant results to a userâs topic of interest has gone beyond search for domain or type specific documents to more focused result (e.g. document fragments or answers to a query). The introduction of XML provides a format standard for data representation, storage, and exchange. It helps focused search to be carried out at different granularities of a structured document with XML markups. This report aims at reviewing the state-of-the-arts in focused search, particularly techniques for topic-specific document retrieval, passage retrieval, XML retrieval, and entity ranking. It is concluded with highlight of open problems
Use of Wikipedia Categories in Entity Ranking
Wikipedia is a useful source of knowledge that has many applications in
language processing and knowledge representation. The Wikipedia category graph
can be compared with the class hierarchy in an ontology; it has some
characteristics in common as well as some differences. In this paper, we
present our approach for answering entity ranking queries from the Wikipedia.
In particular, we explore how to make use of Wikipedia categories to improve
entity ranking effectiveness. Our experiments show that using categories of
example entities works significantly better than using loosely defined target
categories
Entity Ranking in Wikipedia
The traditional entity extraction problem lies in the ability of extracting
named entities from plain text using natural language processing techniques and
intensive training from large document collections. Examples of named entities
include organisations, people, locations, or dates. There are many research
activities involving named entities; we are interested in entity ranking in the
field of information retrieval. In this paper, we describe our approach to
identifying and ranking entities from the INEX Wikipedia document collection.
Wikipedia offers a number of interesting features for entity identification and
ranking that we first introduce. We then describe the principles and the
architecture of our entity ranking system, and introduce our methodology for
evaluation. Our preliminary results show that the use of categories and the
link structure of Wikipedia, together with entity examples, can significantly
improve retrieval effectiveness.Comment: to appea
Seven years of INEX interactive retrieval experiments â lessons and challenges
This paper summarizes a major effort in interactive search investigation,
the INEX i-track, a collective effort run over a seven-year period. We present
the experimental conditions, report some of the findings of the participating
groups, and examine the challenges posed by this kind of collective experimental
effort
Investigating the document structure as a source of evidence for multimedia fragment retrieval
International audienceMultimedia objects can be retrieved using their context that can be for instance the text surrounding them in documents. This text may be either near or far from the searched objects. Our goal in this paper is to study the impact, in term of effectiveness, of text position relatively to searched objects. The multimedia objects we consider are described in structured documents such as XML ones. The document structure is therefore exploited to provide this text position in documents. Although structural information has been shown to be an effective source of evidence in textual information retrieval, only a few works investigated its interest in multimedia retrieval. More precisely, the task we are interested in this paper is to retrieve multimedia fragments (i.e. XML elements having at least one multimedia object). Our general approach is built on two steps: we first retrieve XML elements containing multimedia objects, and we then explore the surrounding information to retrieve relevant multimedia fragments. In both cases, we study the impact of the surrounding information using the documents structure.Our work is carried out on images, but it can be extended to any other media, since the physical content of multimedia objects is not used. We conducted several experiments in the context of the Multimedia track of the INEX evaluation campaign. Results showed that structural evidences are of high interest to tune the importance of textual context for multimedia retrieval. Moreover, the proposed approach outperforms state of the art approaches
The effect of granularity and order in XML element retrieval
The article presents an analysis of the effect of granularity and order in an XML encoded collection of full text journal articles. 218 sessions of searchers performing simulated work tasks in the collection have been analysed. The results show that searchers prefer to use smaller sections of the article as their source of information. In interaction sessions during which articles are assessed, however, they are to a large degree evaluated as more important than the articlesâ sections and subsections
BM25t: a BM25 extension for focused information retrieval
25 pagesInternational audienceThis paper addresses the integration of XML tags into a term-weighting function for focused XML Information Retrieval (IR). Our model allows us to consider a certain kind of structural information: tags that represent a logical structure (e.g. title, section, paragraph, etc.) as well as other tags (e.g. bold, italic, center, etc.). We take into account the influence of a tag by estimating the probability for this tag to distinguish relevant terms from the others. Then, these weights are integrated in a term-weighting function. Experiments on a large collection from the INEX 2008 XML IR evaluation campaign showed improvements on focused XML retrieval
- âŠ