8 research outputs found

    A survey on tree matching and XML retrieval

    Get PDF
    International audienceWith the increasing number of available XML documents, numerous approaches for retrieval have been proposed in the literature. They usually use the tree representation of documents and queries to process them, whether in an implicit or explicit way. Although retrieving XML documents can be considered as a tree matching problem between the query tree and the document trees, only a few approaches take advantage of the algorithms and methods proposed by the graph theory. In this paper, we aim at studying the theoretical approaches proposed in the literature for tree matching and at seeing how these approaches have been adapted to XML querying and retrieval, from both an exact and an approximate matching perspective. This study will allow us to highlight theoretical aspects of graph theory that have not been yet explored in XML retrieval

    The Role of Context in Matching and Evaluation of XML Information Retrieval

    Get PDF
    Sähköisten kokoelmien kasvun, hakujen arkipäiväistymisen ja mobiililaitteiden yleistymisen myötä yksi tiedonhaun menetelmien kehittämisen tavoitteista on saavuttaa alati tarkempia hakutuloksia; pitkistäkin dokumenteista oleellinen sisältö pyritään osoittamaan hakijalle tarkasti. Tiedonhakija pyritään siis vapauttamaan turhasta dokumenttien selaamisesta. Internetissä ja muussa sähköisessä julkaisemisessa dokumenttien osat merkitään usein XML-kielen avulla dokumenttien automaattista käsittelyä varten. XML-merkkaus mahdollistaa dokumenttien sisäisen rakenteen hyödyntämisen. Toisin sanoen tätä merkkausta voidaan hyödyntää kehitettäessä tarkkuusorientoituneita (kohdennettuja) tiedonhakujärjestelmiä ja menetelmiä. Väitöskirja käsittelee tarkkuusorientoitunutta tiedonhakua, jossa eksplisiittistä XML merkkausta voidaan hyödyntää. Väitöskirjassa on kaksi pääteemaa, joista ensimmäisen käsittelee XML -tiedonhakujärjestelmä TRIX:in (Tampere Retrieval and Indexing for XML) kehittämistä, toteuttamista ja arviointia. Toinen teema käsittelee kohdennettujen tiedonhakujärjestelmien empiirisiä arviointimenetelmiä. Ensimmäisen teeman merkittävin kontribuutio on kontekstualisointi, jolloin täsmäytyksessä XML-tiedonhaulle tyypillistä tekstievidenssin vähäisyyttä kompensoidaan hyödyntämällä XML-hierarkian ylempien tai rinnakkaisten osien sisältöä (so. kontekstia). Menetelmän toimivuus osoitetaan empiirisin menetelmin. Tutkimuksen seurauksena kontekstualisointi (contextualization) on vakiintunut alan yleiseen, kansainväliseen sanastoon. Toisessa teemassa todetaan kohdennetun tiedonhaun vaikuttavuuden mittaamiseen käytettävien menetelmien olevan monin tavoin puutteellisia. Puutteiden korjaamiseksi väitöskirjassa kehitetään realistisempia arviointimenetelmiä, jotka ottavat huomioon palautettavien hakuyksiköiden kontekstin, lukemisjärjestyksen ja käyttäjälle selailusta koituvan vaivan. Tutkimuksessa kehitetty mittari (T2I(300)) on valittu varsinaiseksi mittariksi kansainvälisessä INEX (Initiative for the Evaluation of XML Retrieval) hankkeessa, joka on vuonna 2002 perustettu XML tiedonhaun tutkimusfoorumi.This dissertation addresses focused retrieval, especially its sub-concept XML (eXtensible Mark-up Language) information retrieval (XML IR). In XML IR, the retrievable units are either individual elements, or sets of elements grouped together typically by a document. These units are ranked according to their estimated relevance by an XML IR system. In traditional information retrieval, the retrievable unit is an atomic document. Due to this atomicity, many core characteristics of such document retrieval paradigm are not appropriate for XML IR. Of these characteristics, this dissertation explores element indexing, scoring and evaluation methods which form two main themes: 1. Element indexing, scoring, and contextualization 2. Focused retrieval evaluation To investigate the first theme, an XML IR system based on structural indices is constructed. The structural indices offer analyzing power for studying element hierarchies. The main finding in the system development is the utilization of surrounding elements as supplementary evidence in element scoring. This method is called contextualization, for which we distinguish three models: vertical, horizontal and ad hoc contextualizations. The models are tested with the tools provided by (or derived from) the Initiative for the Evaluation of XML retrieval (INEX). The results indicate that the evidence from element surroundings improves the scoring effectiveness of XML retrieval. The second theme entails a task where the retrievable elements are grouped by a document. The aim of this theme is to create methods measuring XML IR effectiveness in a credible fashion in a laboratory environment. The credibility is pursued by assuming the chronological reading order of a user together with a point where the user becomes frustrated after reading a certain amount of non-relevant material. Novel metrics are created based on these assumptions. The relative rankings of systems measured with the metrics differ from those delivered by contemporary metrics. In addition, the focused retrieval strategies benefit from the novel metrics over traditional full document retrieval

    Ranking for Web Data Search Using On-The-Fly Data Integration

    Get PDF
    Ranking - the algorithmic decision on how relevant an information artifact is for a given information need and the sorting of artifacts by their concluded relevancy - is an integral part of every search engine. In this book we investigate how structured Web data can be leveraged for ranking with the goal to improve the effectiveness of search. We propose new solutions for ranking using on-the-fly data integration and experimentally analyze and evaluate them against the latest baselines

    Eight Biennial Report : April 2005 – March 2007

    No full text

    TopX 2.0 at the INEX 2009 Ad-Hoc and Efficiency Tracks

    No full text
    This paper presents the results of our INEX 2009 Ad-hoc and Efficiency track experiments. While our scoring model remained almost unchanged in comparison to previous years, we focused on a complete redesign of our XML indexing component with respect to the increased need for scalability that came with the new 2009 INEX Wikipedia collection, which is about 10 times larger than the previous INEX collection. TopX now supports a CAS-specific distributed index structure, with a completely {\em parallel} execution of all indexing steps, including parsing, sampling of term statistics for our element-specific BM25 ranking model, as well as sorting and compressing the index lists for our final inverted block-index. Overall, TopX ranked among the top 3 systems in both the Ad-hoc and Efficiency tracks, with a maximum value of 0.61 for iP[0.01] and 0.29 for MAiP in focused retrieval mode at the Ad-hoc track. Our fastest runs achieved an average runtime of 72 ms per CO query, and 235 ms per CAS query at the Efficiency track, respectively

    Ranking for Web Data Search Using On-The-Fly Data Integration

    Get PDF
    Ranking - the algorithmic decision on how relevant an information artifact is for a given information need and the sorting of artifacts by their concluded relevancy - is an integral part of every search engine. In this book we investigate how structured Web data can be leveraged for ranking with the goal to improve the effectiveness of search. We propose new solutions for ranking using on-the-fly data integration and experimentally analyze and evaluate them against the latest baselines
    corecore