2 research outputs found

    Die Sphere-Search-Suchmaschine zur graphbasierten Suche auf heterogenen, semistrukturierten Daten

    Get PDF
    In dieser Arbeit wird die neuartige SphereSearch-Suchmaschine vorgestellt, die ein einheitliches ranglistenbasiertes Retrieval auf heterogenen XML- und Web-Daten ermöglicht. Ihre Fähigkeiten umfassen die Auswertung von vagen Struktur- und Inhaltsbedingungen sowie ein auf IR-Statistiken und einem graph-basierten Datenmodell basierendes Relevanz-Ranking. Web-Dokumente im HTML- und PDFFormat werden zunächst automatisch in ein XML-Zwischenformat konvertiert und anschließend mit Hilfe von Annotations-Tools durch zusätzliche Tags semantisch angereichtert. Die graph-basierte Suchmaschine bietet auf semi-strukturierten Daten vielfältige Suchmöglichkeiten, die von keiner herkömmlichen Web- oder XMLSuchmaschine ausgedrückt werden können: konzeptbewusste und kontextbewusste Suche, die sowohl die implizite Struktur von Daten als auch ihren Kontext berücksichtigt. Die Vorteile der SphereSearch-Suchmaschine werden durch Experimente auf verschiedenen Dokumentenkorpora demonstriert. Diese umfassen eine große, vielfältige Tags beinhaltende, nicht-schematische Enzyklopädie, die um externe Dokumente erweitert wurde, sowie einen Standard-XML-Benchmark.This thesis presents the novel SphereSearch Engine that provides unified ranked retrieval on heterogeneous XML andWeb data. Its search capabilities include vague structure and text content conditions, and relevance ranking based on IR statistics and a graph-based data model. Web pages in HTML or PDF are automatically converted into an intermediate XML format, with the option of generating semantic tags by means of linguistic annotation tools. For semi-structured data the graphbased query engine is leveraged to provide very rich search options that cannot be expressed in traditional Web or XML search engines: concept-aware and linkaware querying that takes into account the implicit structure and context of Web pages. The benefits of the SphereSearch engine are demonstrated by experiments with a large and richly tagged but non-schematic open encyclopedia extended with external documents and a standard XML benchmark

    Evaluation of effective XML information retrieval

    Get PDF
    XML is being adopted as a common storage format in scientific data repositories, digital libraries, and on the World Wide Web. Accordingly, there is a need for content-oriented XML retrieval systems that can efficiently and effectively store, search and retrieve information from XML document collections. Unlike traditional information retrieval systems where whole documents are usually indexed and retrieved as information units, XML retrieval systems typically index and retrieve document components of varying granularity. To evaluate the effectiveness of such systems, test collections where relevance assessments are provided according to an XML-specific definition of relevance are necessary. Such test collections have been built during four rounds of the INitiative for the Evaluation of XML Retrieval (INEX). There are many different approaches to XML retrieval; most approaches either extend full-text information retrieval systems to handle XML retrieval, or use database technologies that incorporate existing XML standards to handle both XML presentation and retrieval. We present a hybrid approach to XML retrieval that combines text information retrieval features with XML-specific features found in a native XML database. Results from our experiments on the INEX 2003 and 2004 test collections demonstrate the usefulness of applying our hybrid approach to different XML retrieval tasks. A realistic definition of relevance is necessary for meaningful comparison of alternative XML retrieval approaches. The three relevance definitions used by INEX since 2002 comprise two relevance dimensions, each based on topical relevance. We perform an extensive analysis of the two INEX 2004 and 2005 relevance definitions, and show that assessors and users find them difficult to understand. We propose a new definition of relevance for XML retrieval, and demonstrate that a relevance scale based on this definition is useful for XML retrieval experiments. Finding the appropriate approach to evaluate XML retrieval effectiveness is the subject of ongoing debate within the XML information retrieval research community. We present an overview of the evaluation methodologies implemented in the current INEX metrics, which reveals that the metrics follow different assumptions and measure different XML retrieval behaviours. We propose a new evaluation metric for XML retrieval and conduct an extensive analysis of the retrieval performance of simulated runs to show what is measured. We compare the evaluation behaviour obtained with the new metric to the behaviours obtained with two of the official INEX 2005 metrics, and demonstrate that the new metric can be used to reliably evaluate XML retrieval effectiveness. To analyse the effectiveness of XML retrieval in different application scenarios, we use evaluation measures in our new metric to investigate the behaviour of XML retrieval approaches under the following two scenarios: the ad-hoc retrieval scenario, exploring the activities carried out as part of the INEX 2005 Ad-hoc track; and the multimedia retrieval scenario, exploring the activities carried out as part of the INEX 2005 Multimedia track. For both application scenarios we show that, although different values for retrieval parameters are needed to achieve the optimal performance, the desired textual or multimedia information can be effectively located using a combination of XML retrieval approaches
    corecore