15 research outputs found

    The State-of-the-arts in Focused Search

    Get PDF
    The continuous influx of various text data on the Web requires search engines to improve their retrieval abilities for more specific information. The need for relevant results to a user’s topic of interest has gone beyond search for domain or type specific documents to more focused result (e.g. document fragments or answers to a query). The introduction of XML provides a format standard for data representation, storage, and exchange. It helps focused search to be carried out at different granularities of a structured document with XML markups. This report aims at reviewing the state-of-the-arts in focused search, particularly techniques for topic-specific document retrieval, passage retrieval, XML retrieval, and entity ranking. It is concluded with highlight of open problems

    The State-of-the-arts in Focused Search

    Get PDF

    Investigating the document structure as a source of evidence for multimedia fragment retrieval

    Get PDF
    International audienceMultimedia objects can be retrieved using their context that can be for instance the text surrounding them in documents. This text may be either near or far from the searched objects. Our goal in this paper is to study the impact, in term of effectiveness, of text position relatively to searched objects. The multimedia objects we consider are described in structured documents such as XML ones. The document structure is therefore exploited to provide this text position in documents. Although structural information has been shown to be an effective source of evidence in textual information retrieval, only a few works investigated its interest in multimedia retrieval. More precisely, the task we are interested in this paper is to retrieve multimedia fragments (i.e. XML elements having at least one multimedia object). Our general approach is built on two steps: we first retrieve XML elements containing multimedia objects, and we then explore the surrounding information to retrieve relevant multimedia fragments. In both cases, we study the impact of the surrounding information using the documents structure.Our work is carried out on images, but it can be extended to any other media, since the physical content of multimedia objects is not used. We conducted several experiments in the context of the Multimedia track of the INEX evaluation campaign. Results showed that structural evidences are of high interest to tune the importance of textual context for multimedia retrieval. Moreover, the proposed approach outperforms state of the art approaches

    Semantics and result disambiguation for keyword search on tree data

    Get PDF
    Keyword search is a popular technique for searching tree-structured data (e.g., XML, JSON) on the web because it frees the user from learning a complex query language and the structure of the data sources. However, the convenience of keyword search comes with drawbacks. The imprecision of the keyword queries usually results in a very large number of results of which only very few are relevant to the query. Multiple previous approaches have tried to address this problem. Some of them exploit structural and semantic properties of the tree data in order to filter out irrelevant results while others use a scoring function to rank the candidate results. These are not easy tasks though and in both cases, relevant results might be missed and the users might spend a significant amount of time searching for their intended result in a plethora of candidates. Another drawback of keyword search on tree data, also due to the incapacity of keyword queries to precisely express the user intent, is that the query answer may contain different types of meaningful results even though the user is interested in only some of them. Both problems of keyword search on tree data are addressed in this dissertation. First, an original approach for answering keyword queries is proposed. This approach extracts structural patterns of the query matches and reasons with them in order to return meaningful results ranked with respect to their relevance to the query. The proposed semantics performs comparisons between patterns of results by using different types of ho-momorphisms between the patterns. These comparisons are used to organize the patterns into a graph of patterns which is leveraged to determine ranking and filtering semantics. The experimental results show that the approach produces query results of higher quality compared to the previous ones. To address the second problem, an original approach for clustering the keyword search results on tree data is introduced. The clustered output allows the user to focus on a subset of the results, and to save time and effort while looking for the relevant results. The approach performs clustering at different levels of granularity to group similar results together effectively. The similarity of the results and result clusters is decided using relations on structural patterns of the results defined based on homomor-phisms between path patterns. An originality of the clustering approach is that the clusters are ranked at different levels of granularity to quickly guide the user to the relevant result patterns. An efficient stack-based algorithm is presented for generating result patterns and constructing the clustering hierarchy. The extensive experimentation with multiple real datasets show that the algorithm is fast and scalable. It also shows that the clustering methodology allows the users to effectively retrieve their intended results, and outperforms a recent state-of-the-art clustering approach. In order to tackle the second problem from a different aspect, diversifying the results of keyword search is addressed. Diversification aims to provide the users with a ranked list of results which balances the relevance and redundancy of the results. Measures for quantifying the relevance and dissimilarity of result patterns are presented and a heuristic for generating a diverse set of results using these metrics is introduced

    Probabilistic retrieval models - relationships, context-specific application, selection and implementation

    Get PDF
    PhDRetrieval models are the core components of information retrieval systems, which guide the document and query representations, as well as the document ranking schemes. TF-IDF, binary independence retrieval (BIR) model and language modelling (LM) are three of the most influential contemporary models due to their stability and performance. The BIR model and LM have probabilistic theory as their basis, whereas TF-IDF is viewed as a heuristic model, whose theoretical justification always fascinates researchers. This thesis firstly investigates the parallel derivation of BIR model, LM and Poisson model, wrt event spaces, relevance assumptions and ranking rationales. It establishes a bridge between the BIR model and LM, and derives TF-IDF from the probabilistic framework. Then, the thesis presents the probabilistic logical modelling of the retrieval models. Various ways of how to estimate and aggregate probability, and alternative implementation to nonprobabilistic operator are demonstrated. Typical models have been implemented. The next contribution concerns the usage of of context-specific frequencies, i.e., the frequencies counted based on assorted element types or within different text scopes. The hypothesis is that they can help to rank the elements in structured document retrieval. The thesis applies context-specific frequencies on term weighting schemes in these models, and the outcome is a generalised retrieval model with regard to both element and document ranking. The retrieval models behave differently on the same query set: for some queries, one model performs better, for other queries, another model is superior. Therefore, one idea to improve the overall performance of a retrieval system is to choose for each query the model that is likely to perform the best. This thesis proposes and empirically explores the model selection method according to the correlation of query feature and query performance, which contributes to the methodology of dynamically choosing a model. In summary, this thesis contributes a study of probabilistic models and their relationships, the probabilistic logical modelling of retrieval models, the usage and effect of context-specific frequencies in models, and the selection of retrieval models

    Recherche d'information dans les documents XML : prise en compte des liens pour la sélection d'éléments pertinents

    Get PDF
    156 p. : ill. ; 30 cmNotre travail se situe dans le contexte de la recherche d'information (RI), plus particuliĂšrement la recherche d'information dans des documents semi structurĂ©s de type XML. L'exploitation efficace des documents XML disponibles doit prendre en compte la dimension structurelle. Cette dimension a conduit Ă  l'Ă©mergence de nouveaux dĂ©fis dans le domaine de la RI. Contrairement aux approches classiques de RI qui mettent l'accent sur la recherche des contenus non structurĂ©s, la RI XML combine Ă  la fois des informations textuelles et structurelles pour effectuer diffĂ©rentes tĂąches de recherche. Plusieurs approches exploitant les types d'Ă©vidence ont Ă©tĂ© proposĂ©es et sont principalement basĂ©es sur les modĂšles classiques de RI, adaptĂ©es Ă  des documents XML. La structure XML a Ă©tĂ© utilisĂ©e pour fournir un accĂšs ciblĂ© aux documents, en retournant des composants de document (par exemple, sections, paragraphes, etc.), au lieu de retourner tout un document en rĂ©ponse une requĂȘte de l'utilisateur. En RI traditionnelle, la mesure de similaritĂ© est gĂ©nĂ©ralement basĂ©e sur l'information textuelle. Elle permetle classement des documents en fonction de leur degrĂ© de pertinence en utilisant des mesures comme:" similitude terme " ou " probabilitĂ© terme ". Cependant, d'autres sources d'Ă©vidence peuvent ĂȘtre considĂ©rĂ©es pour rechercher des informations pertinentes dans les documents. Par exemple, les liens hypertextes ont Ă©tĂ© largement exploitĂ©s dans le cadre de la RI sur le Web.MalgrĂ© leur popularitĂ© dans le contexte du Web, peud'approchesexploitant cette source d'Ă©vidence ont Ă©tĂ© proposĂ©es dans le contexte de la RI XML. Le but de notre travail est de proposer des approches pour l'utilisation de liens comme une source d'Ă©videncedans le cadre de la recherche d'information XML. Cette thĂšse vise Ă  apporter des rĂ©ponses aux questions de recherche suivantes : 1. Peut-on considĂ©rer les liens comme une source d'Ă©vidence dans le contexte de la RIXML? 2. Est-ce que l'utilisation de certains algorithmes d'analyse de liensdans le contexte de la RI XML amĂ©liore la qualitĂ© des rĂ©sultats, en particulier dans le cas de la collection Wikipedia? 3. Quels types de liens peuvent ĂȘtre utilisĂ©s pour amĂ©liorer le mieux la pertinence des rĂ©sultats de recherche? 4. Comment calculer le score lien des diffĂ©rents Ă©lĂ©ments retournĂ©s comme rĂ©sultats de recherche? Doit-on considĂ©rer lesliens de type "document-document" ou plus prĂ©cisĂ©ment les liens de type "Ă©lĂ©ment-Ă©lĂ©ment"? Quel est le poids des liens de navigation par rapport aux liens hiĂ©rarchiques? 5. Quel est l'impact d'utilisation de liens dans le contexte global ou local? 6. Comment intĂ©grer le score lien dans le calcul du score final des Ă©lĂ©ments XML retournĂ©s? 7. Quel est l'impact de la qualitĂ© des premiers rĂ©sultats sur le comportement des formules proposĂ©es? Pour rĂ©pondre Ă  ces questions, nous avons menĂ© une Ă©tude statistique, sur les rĂ©sultats de recherche retournĂ©s par le systĂšme de recherche d'information"DALIAN", qui a clairement montrĂ© que les liens reprĂ©sentent un signe de pertinence des Ă©lĂ©ments dans le contexte de la RI XML, et cecien utilisant la collection de test fournie par INEX. Aussi, nous avons implĂ©mentĂ© trois algorithmes d'analyse des liens (Pagerank, HITS et SALSA) qui nous ont permis de rĂ©aliser une Ă©tude comparative montrant que les approches "query-dependent" sont les meilleures par rapport aux approches "global context" . Nous avons proposĂ© durant cette thĂšse trois formules de calcul du score lien: Le premiĂšreest appelĂ©e "Topical Pagerank"; la seconde est la formule : "distance-based"; et la troisiĂšme est :"weighted links based". Nous avons proposĂ© aussi trois formules de combinaison, Ă  savoir, la formule linĂ©aire, la formule Dempster-Shafer et la formule fuzzy-based. Enfin, nous avons menĂ© une sĂ©rie d'expĂ©rimentations. Toutes ces expĂ©rimentations ont montrĂ© que: les approches proposĂ©es ont permis d'amĂ©liorer la pertinence des rĂ©sultats pour les diffĂ©rentes configurations testĂ©es; les approches "query-dependent" sont les meilleurescomparĂ©es aux approches global context; les approches exploitant les liens de type "Ă©lĂ©ment-Ă©lĂ©ment"ont obtenu de bons rĂ©sultats; les formules de combinaison qui se basent sur le principe de l'incertitude pour le calcul des scores finaux des Ă©lĂ©ments XML permettent de rĂ©aliser de bonnes performance

    ModÚle flexible pour la Recherche d'Information dans des corpus de documents semi-structurés

    Get PDF
    Structural information contained in semi-structured documents can be used to focus on relevant information. The aim of Information Retrieval System is then to retrieve relevant information units instead of whole documents. We propose here the XFIRM model (XML Flexible Information Retrieval model), which is based on: (i) a generic data representation model, allowing the modelling of documents having heterogeneous structures; (ii) a flexible query language that allows the expression of users needs according to many precision degrees, by expressing (or not) conditions on the documents structure; (iii) a retrieval model based on a relevance propagation method, which aims at finding the most exhaustive and specific information units answering the query. The interest of our propositions has been shown thanks to the prototype we developedLa nature de sources d'information Ă©volue, et les documents numĂ©riques traditionnels plats ne contenant que du texte s'enrichissent d'information structurelle et multimĂ©dia. Cette Ă©volution est accĂ©lĂ©rĂ©e par l'expansion du Web, et les documents semi-structurĂ©s de type XML (eXtensible Markup Language) tendent Ă  former la majoritĂ© des documents numĂ©riques mis Ă  disposition des utilisateurs. Le dĂ©veloppement d'outils automatisĂ©s permettant un accĂšs efficace Ă  ce nouveau type d'information numĂ©rique apparaĂźt comme une nĂ©cessitĂ©. Afin de valoriser au mieux l'ensemble des informations disponibles, les mĂ©thodes existantes de Recherche d'Information (RI) doivent ĂȘtre adaptĂ©es. L'information structurelle des documents peut en effet servir Ă  affiner le concept de granule documentaire. Le but pour les SystĂšmes de Recherche d'Information (SRI) est alors de retrouver des unitĂ©s d'information (et non plus de documents) pertinentes Ă  des requĂȘtes utilisateur. Afin de rĂ©pondre Ă  cette problĂ©matique fondamentale, de nouveaux modĂšles prenant en compte l'information structurelle des documents, tant au niveau de l'indexation, de l'interrogation que de la recherche doivent ĂȘtre construits. L'objectif de nos travaux est de proposer un modĂšle permettant d'effectuer des recherches flexibles dans des corpus de document semi-structurĂ©s. Ceci nous a conduit Ă  proposer le modĂšle XFIRM (XML Flexible Information Retrieval Model ) reposant sur : (i) Un modĂšle de reprĂ©sentation des donnĂ©es gĂ©nĂ©rique, permettant de modĂ©liser des documents possĂ©dant des structures diffĂ©rentes ; (ii) Un langage de requĂȘte flexible, permettant Ă  l'utilisateur d'exprimer son besoin selon divers degrĂ©s de prĂ©cision, en exprimant ou non des conditions sur la structure des documents ; (iii) Un modĂšle de recherche basĂ©e sur une mĂ©thode de propagation de la pertinence. Ce modĂšle a pour but de trouver les unitĂ©s d'information les plus exhaustives et spĂ©cifiques rĂ©pondant Ă  une requĂȘte utilisateur, que celle-ci contienne ou non des conditions de structure. Les documents semi-structurĂ©s peuvent ĂȘtre reprĂ©sentĂ©s sous forme arborescente, et le but est alors de trouver les sous-arbres de taille minimale rĂ©pondant Ă  la requĂȘte. Les recherches sur le contenu seul des documents sont effectuĂ©es en prenant en compte les importances diverses des feuilles des sous-arbres, et en plaçant ces derniers dans leur contexte, c'est Ă  dire, en tenant compte de la pertinence du document. Les recherches portant Ă  la fois sur le contenu et la structure des documents sont effectuĂ©es grĂące Ă  plusieurs propagations de pertinence dans l'arbre du document, et ce afin d'effectuer une correspondance vague entre l'arbre du document et l'arbre de la requĂȘte. L'Ă©valuation de notre modĂšle, grĂące au prototype que nous avons dĂ©veloppĂ©, montre l'intĂ©rĂȘt de nos propositions, que ce soit pour effectuer des recherches sur le contenu seul des documents que sur le contenu et la structure

    Un modÚle de recherche d'information agrégée basée sur les réseaux bayésiens dans des documents semi-structurés

    Get PDF
    Nous proposons un modĂšle de recherche d'information basĂ© sur les rĂ©seaux bayĂ©siens. Dans ce modĂšle, la requĂȘte de l'utilisateur dĂ©clenche un processus de propagation pour sĂ©lectionner les Ă©lĂ©ments pertinents. Dans notre modĂšle, nous cherchons Ă  renvoyer Ă  l'utilisateur un agrĂ©gat au lieu d'une liste d'Ă©lĂ©ments. En fait, l'agrĂ©gat formulĂ© Ă  partir d'un document est considĂ©rĂ© comme Ă©tant un ensemble d'Ă©lĂ©ments ou une unitĂ© d'information (portion d'un document) qui rĂ©pond le mieux Ă  la requĂȘte de l'utilisateur. Cet agrĂ©gat doit rĂ©pondre Ă  trois aspects Ă  savoir la pertinence, la non-redondance et la complĂ©mentaritĂ© pour qu'il soit qualifiĂ© comme une rĂ©ponse Ă  cette requĂȘte. L'utilitĂ© des agrĂ©gats retournĂ©s est qu'ils donnent Ă  l'utilisateur un aperçu sur le contenu informationnel de cette requĂȘte dans la collection de documents. Afin de valider notre modĂšle, nous l'avons Ă©valuĂ© dans le cadre de la campagne d'Ă©valuation INEX 2009 (utilisant plus que 2 666 000 documents XML de l'encyclopĂ©die en ligne WikipĂ©dia). Les expĂ©rimentations montrent l'intĂ©rĂȘt de cette approche en mettant en Ă©vidence l'impact de l'agrĂ©gation de tels Ă©lĂ©ments.The work described in this thesis are concerned with the aggregated search on XML elements. We propose new approaches to aggregating and pruning using different sources of evidence (content and structure). We propose a model based on Bayesian networks. The dependency relationships between query-terms and terms-elements are quantified by probability measures. In this model, the user's query triggers a propagation process to find XML elements. In our model, we search to return to the user an aggregate instead of a list of XML elements. In fact, the aggregate made from a document is considered an information unit (or a portion of this document) that best meets the user's query. This aggregate must meet three aspects namely relevance, non-redundancy and complementarity in order to answer the query. The value returned aggregates is that they give the user an overview of the information need in the collection
    corecore