21 research outputs found

    Focused Retrieval

    Get PDF
    Traditional information retrieval applications, such as Web search, return atomic units of retrieval, which are generically called ``documents''. Depending on the application, a document may be a Web page, an email message, a journal article, or any similar object. In contrast to this traditional approach, focused retrieval helps users better pin-point their exact information needs by returning results at the sub-document level. These results may consist of predefined document components~---~such as pages, sections, and paragraphs~---~or they may consist of arbitrary passages, comprising any sub-string of a document. If a document is marked up with XML, a focused retrieval system might return individual XML elements or ranges of elements. This thesis proposes and evaluates a number of approaches to focused retrieval, including methods based on XML markup and methods based on arbitrary passages. It considers the best unit of retrieval, explores methods for efficient sub-document retrieval, and evaluates formulae for sub-document scoring. Focused retrieval is also considered in the specific context of the Wikipedia, where methods for automatic vandalism detection and automatic link generation are developed and evaluated

    Recherche d'information dans les documents XML : prise en compte des liens pour la sélection d'éléments pertinents

    Get PDF
    156 p. : ill. ; 30 cmNotre travail se situe dans le contexte de la recherche d'information (RI), plus particuliĂšrement la recherche d'information dans des documents semi structurĂ©s de type XML. L'exploitation efficace des documents XML disponibles doit prendre en compte la dimension structurelle. Cette dimension a conduit Ă  l'Ă©mergence de nouveaux dĂ©fis dans le domaine de la RI. Contrairement aux approches classiques de RI qui mettent l'accent sur la recherche des contenus non structurĂ©s, la RI XML combine Ă  la fois des informations textuelles et structurelles pour effectuer diffĂ©rentes tĂąches de recherche. Plusieurs approches exploitant les types d'Ă©vidence ont Ă©tĂ© proposĂ©es et sont principalement basĂ©es sur les modĂšles classiques de RI, adaptĂ©es Ă  des documents XML. La structure XML a Ă©tĂ© utilisĂ©e pour fournir un accĂšs ciblĂ© aux documents, en retournant des composants de document (par exemple, sections, paragraphes, etc.), au lieu de retourner tout un document en rĂ©ponse une requĂȘte de l'utilisateur. En RI traditionnelle, la mesure de similaritĂ© est gĂ©nĂ©ralement basĂ©e sur l'information textuelle. Elle permetle classement des documents en fonction de leur degrĂ© de pertinence en utilisant des mesures comme:" similitude terme " ou " probabilitĂ© terme ". Cependant, d'autres sources d'Ă©vidence peuvent ĂȘtre considĂ©rĂ©es pour rechercher des informations pertinentes dans les documents. Par exemple, les liens hypertextes ont Ă©tĂ© largement exploitĂ©s dans le cadre de la RI sur le Web.MalgrĂ© leur popularitĂ© dans le contexte du Web, peud'approchesexploitant cette source d'Ă©vidence ont Ă©tĂ© proposĂ©es dans le contexte de la RI XML. Le but de notre travail est de proposer des approches pour l'utilisation de liens comme une source d'Ă©videncedans le cadre de la recherche d'information XML. Cette thĂšse vise Ă  apporter des rĂ©ponses aux questions de recherche suivantes : 1. Peut-on considĂ©rer les liens comme une source d'Ă©vidence dans le contexte de la RIXML? 2. Est-ce que l'utilisation de certains algorithmes d'analyse de liensdans le contexte de la RI XML amĂ©liore la qualitĂ© des rĂ©sultats, en particulier dans le cas de la collection Wikipedia? 3. Quels types de liens peuvent ĂȘtre utilisĂ©s pour amĂ©liorer le mieux la pertinence des rĂ©sultats de recherche? 4. Comment calculer le score lien des diffĂ©rents Ă©lĂ©ments retournĂ©s comme rĂ©sultats de recherche? Doit-on considĂ©rer lesliens de type "document-document" ou plus prĂ©cisĂ©ment les liens de type "Ă©lĂ©ment-Ă©lĂ©ment"? Quel est le poids des liens de navigation par rapport aux liens hiĂ©rarchiques? 5. Quel est l'impact d'utilisation de liens dans le contexte global ou local? 6. Comment intĂ©grer le score lien dans le calcul du score final des Ă©lĂ©ments XML retournĂ©s? 7. Quel est l'impact de la qualitĂ© des premiers rĂ©sultats sur le comportement des formules proposĂ©es? Pour rĂ©pondre Ă  ces questions, nous avons menĂ© une Ă©tude statistique, sur les rĂ©sultats de recherche retournĂ©s par le systĂšme de recherche d'information"DALIAN", qui a clairement montrĂ© que les liens reprĂ©sentent un signe de pertinence des Ă©lĂ©ments dans le contexte de la RI XML, et cecien utilisant la collection de test fournie par INEX. Aussi, nous avons implĂ©mentĂ© trois algorithmes d'analyse des liens (Pagerank, HITS et SALSA) qui nous ont permis de rĂ©aliser une Ă©tude comparative montrant que les approches "query-dependent" sont les meilleures par rapport aux approches "global context" . Nous avons proposĂ© durant cette thĂšse trois formules de calcul du score lien: Le premiĂšreest appelĂ©e "Topical Pagerank"; la seconde est la formule : "distance-based"; et la troisiĂšme est :"weighted links based". Nous avons proposĂ© aussi trois formules de combinaison, Ă  savoir, la formule linĂ©aire, la formule Dempster-Shafer et la formule fuzzy-based. Enfin, nous avons menĂ© une sĂ©rie d'expĂ©rimentations. Toutes ces expĂ©rimentations ont montrĂ© que: les approches proposĂ©es ont permis d'amĂ©liorer la pertinence des rĂ©sultats pour les diffĂ©rentes configurations testĂ©es; les approches "query-dependent" sont les meilleurescomparĂ©es aux approches global context; les approches exploitant les liens de type "Ă©lĂ©ment-Ă©lĂ©ment"ont obtenu de bons rĂ©sultats; les formules de combinaison qui se basent sur le principe de l'incertitude pour le calcul des scores finaux des Ă©lĂ©ments XML permettent de rĂ©aliser de bonnes performance

    BM25t: a BM25 extension for focused information retrieval

    No full text
    25 pagesInternational audienceThis paper addresses the integration of XML tags into a term-weighting function for focused XML Information Retrieval (IR). Our model allows us to consider a certain kind of structural information: tags that represent a logical structure (e.g. title, section, paragraph, etc.) as well as other tags (e.g. bold, italic, center, etc.). We take into account the influence of a tag by estimating the probability for this tag to distinguish relevant terms from the others. Then, these weights are integrated in a term-weighting function. Experiments on a large collection from the INEX 2008 XML IR evaluation campaign showed improvements on focused XML retrieval

    XML retrieval using pruned element-index files

    Get PDF
    An element-index is a crucial mechanism for supporting content-only (CO) queries over XML collections. A full element-index that indexes each element along with the content of its descendants involves a high redundancy and reduces query processing efficiency. A direct index, on the other hand, only indexes the content that is directly under each element and disregards the descendants. This results in a smaller index, but possibly in return to some reduction in system effectiveness. In this paper, we propose using static index pruning techniques for obtaining more compact index files that can still result in comparable retrieval performance to that of a full index. We also compare the retrieval performance of these pruning based approaches to some other strategies that make use of a direct element-index. Our experiments conducted along with the lines of INEX evaluation framework reveal that pruned index files yield comparable to or even better retrieval performance than the full index and direct index, for several tasks in the ad hoc track. © 2010 Springer-Verlag Berlin Heidelberg

    Applying Wikipedia to Interactive Information Retrieval

    Get PDF
    There are many opportunities to improve the interactivity of information retrieval systems beyond the ubiquitous search box. One idea is to use knowledge bases—e.g. controlled vocabularies, classification schemes, thesauri and ontologies—to organize, describe and navigate the information space. These resources are popular in libraries and specialist collections, but have proven too expensive and narrow to be applied to everyday webscale search. Wikipedia has the potential to bring structured knowledge into more widespread use. This online, collaboratively generated encyclopaedia is one of the largest and most consulted reference works in existence. It is broader, deeper and more agile than the knowledge bases put forward to assist retrieval in the past. Rendering this resource machine-readable is a challenging task that has captured the interest of many researchers. Many see it as a key step required to break the knowledge acquisition bottleneck that crippled previous efforts. This thesis claims that the roadblock can be sidestepped: Wikipedia can be applied effectively to open-domain information retrieval with minimal natural language processing or information extraction. The key is to focus on gathering and applying human-readable rather than machine-readable knowledge. To demonstrate this claim, the thesis tackles three separate problems: extracting knowledge from Wikipedia; connecting it to textual documents; and applying it to the retrieval process. First, we demonstrate that a large thesaurus-like structure can be obtained directly from Wikipedia, and that accurate measures of semantic relatedness can be efficiently mined from it. Second, we show that Wikipedia provides the necessary features and training data for existing data mining techniques to accurately detect and disambiguate topics when they are mentioned in plain text. Third, we provide two systems and user studies that demonstrate the utility of the Wikipedia-derived knowledge base for interactive information retrieval

    Using Explicit Semantic Analysis to Link in Multi-Lingual Document Collections

    Get PDF
    UdrĆŸovĂĄnĂ­ prolinkovĂĄnĂ­ dokumentĆŻ v ryhle rostoucĂ­ch kolekcĂ­ch je problematickĂ©. To je dĂĄle zvětĆĄeno vĂ­cejazyčnostĂ­ těchto kolekcĂ­. Navrhujeme pouĆŸĂ­t ExplicitnĂ­ SĂ©mantickou AnalĂœzu k identifikaci relevantnĂ­ch dokumentĆŻ a linkĆŻ napƙíč jazyky, bez pouĆŸitĂ­ strojovĂ©ho pƙekladu. Navrhli jsme a implementovali několik pƙistupĆŻ v prototypu linkovacĂ­ho systĂ©mu. Evaluace byla provedena na ČínskĂ©, ČeskĂ©, AnglickĂ© a Ć panělskĂ© Wikipedii. Diskutujeme evaluačnĂ­ metodologii pro linkovacĂ­ systĂ©my, a hodnotĂ­me souhlasnost mezi odkazy v rĆŻznĂœch jazykoĂœch verzĂ­ch Wikipedie. HodnotĂ­me vlastnosti ExplicitnĂ­ SĂ©mantickĂ© AnalĂœzy dĆŻleĆŸitĂ© pro jejĂ­ praktickĂ© pouĆŸitĂ­.Keeping links in quickly growing document collections up-to-date is problematic, which is exacerbated by their multi-linguality. We utilize Explicit Semantic Analysis to help identify relevant documents and links across languages without machine translation. We designed and implemented several approaches as a part of our link discovery system. Evaluation was conducted on Chinese, Czech, English and Spanish Wikipedia. Also, we discuss the evaluation methodology for such systems and assess the agreement between links on different versions of Wikipedia. In addition, we evaluate properties of Explicit Semantic Analysis which are important for its practical use.

    The Role of Context in Matching and Evaluation of XML Information Retrieval

    Get PDF
    SÀhköisten kokoelmien kasvun, hakujen arkipÀivÀistymisen ja mobiililaitteiden yleistymisen myötÀ yksi tiedonhaun menetelmien kehittÀmisen tavoitteista on saavuttaa alati tarkempia hakutuloksia; pitkistÀkin dokumenteista oleellinen sisÀltö pyritÀÀn osoittamaan hakijalle tarkasti. Tiedonhakija pyritÀÀn siis vapauttamaan turhasta dokumenttien selaamisesta. InternetissÀ ja muussa sÀhköisessÀ julkaisemisessa dokumenttien osat merkitÀÀn usein XML-kielen avulla dokumenttien automaattista kÀsittelyÀ varten. XML-merkkaus mahdollistaa dokumenttien sisÀisen rakenteen hyödyntÀmisen. Toisin sanoen tÀtÀ merkkausta voidaan hyödyntÀÀ kehitettÀessÀ tarkkuusorientoituneita (kohdennettuja) tiedonhakujÀrjestelmiÀ ja menetelmiÀ. VÀitöskirja kÀsittelee tarkkuusorientoitunutta tiedonhakua, jossa eksplisiittistÀ XML merkkausta voidaan hyödyntÀÀ. VÀitöskirjassa on kaksi pÀÀteemaa, joista ensimmÀisen kÀsittelee XML -tiedonhakujÀrjestelmÀ TRIX:in (Tampere Retrieval and Indexing for XML) kehittÀmistÀ, toteuttamista ja arviointia. Toinen teema kÀsittelee kohdennettujen tiedonhakujÀrjestelmien empiirisiÀ arviointimenetelmiÀ. EnsimmÀisen teeman merkittÀvin kontribuutio on kontekstualisointi, jolloin tÀsmÀytyksessÀ XML-tiedonhaulle tyypillistÀ tekstievidenssin vÀhÀisyyttÀ kompensoidaan hyödyntÀmÀllÀ XML-hierarkian ylempien tai rinnakkaisten osien sisÀltöÀ (so. kontekstia). MenetelmÀn toimivuus osoitetaan empiirisin menetelmin. Tutkimuksen seurauksena kontekstualisointi (contextualization) on vakiintunut alan yleiseen, kansainvÀliseen sanastoon. Toisessa teemassa todetaan kohdennetun tiedonhaun vaikuttavuuden mittaamiseen kÀytettÀvien menetelmien olevan monin tavoin puutteellisia. Puutteiden korjaamiseksi vÀitöskirjassa kehitetÀÀn realistisempia arviointimenetelmiÀ, jotka ottavat huomioon palautettavien hakuyksiköiden kontekstin, lukemisjÀrjestyksen ja kÀyttÀjÀlle selailusta koituvan vaivan. Tutkimuksessa kehitetty mittari (T2I(300)) on valittu varsinaiseksi mittariksi kansainvÀlisessÀ INEX (Initiative for the Evaluation of XML Retrieval) hankkeessa, joka on vuonna 2002 perustettu XML tiedonhaun tutkimusfoorumi.This dissertation addresses focused retrieval, especially its sub-concept XML (eXtensible Mark-up Language) information retrieval (XML IR). In XML IR, the retrievable units are either individual elements, or sets of elements grouped together typically by a document. These units are ranked according to their estimated relevance by an XML IR system. In traditional information retrieval, the retrievable unit is an atomic document. Due to this atomicity, many core characteristics of such document retrieval paradigm are not appropriate for XML IR. Of these characteristics, this dissertation explores element indexing, scoring and evaluation methods which form two main themes: 1. Element indexing, scoring, and contextualization 2. Focused retrieval evaluation To investigate the first theme, an XML IR system based on structural indices is constructed. The structural indices offer analyzing power for studying element hierarchies. The main finding in the system development is the utilization of surrounding elements as supplementary evidence in element scoring. This method is called contextualization, for which we distinguish three models: vertical, horizontal and ad hoc contextualizations. The models are tested with the tools provided by (or derived from) the Initiative for the Evaluation of XML retrieval (INEX). The results indicate that the evidence from element surroundings improves the scoring effectiveness of XML retrieval. The second theme entails a task where the retrievable elements are grouped by a document. The aim of this theme is to create methods measuring XML IR effectiveness in a credible fashion in a laboratory environment. The credibility is pursued by assuming the chronological reading order of a user together with a point where the user becomes frustrated after reading a certain amount of non-relevant material. Novel metrics are created based on these assumptions. The relative rankings of systems measured with the metrics differ from those delivered by contemporary metrics. In addition, the focused retrieval strategies benefit from the novel metrics over traditional full document retrieval
    corecore