13 research outputs found

    Flexible and efficient IR using array databases

    Get PDF
    textabstractThe Matrix Framework is a recent proposal by IR researchers to flexibly represent all important information retrieval models in a single multi-dimensional array framework. Computational support for exactly this framework is provided by the array database system SRAM (Sparse Relational Array Mapping) that works on top of a DBMS. Information retrieval models can be specified in its comprehension-based array query language, in a way that directly corresponds to the underlying mathematical formulas. SRAM efficiently stores sparse arrays in (compressed) relational tables and translates and optimizes array queries into relational queries. In this work, we describe a number of array query optimization rules and demonstrate their effect on text retrieval in the TREC TeraByte track (TREC-TB) efficiency task, using the Okapi BM25 model as our example. It turns out that these optimization rules enable SRAM to automatically translate the BM25 array queries into the relational equivalent of inverted list processing including compression, score materialization and quantization, such as employed by custom-built IR systems. The use of the high-performance MonetDB/X100 relational backend, that provides transparent database compression, allows the system to achieve very fast response times with good precision and low resource usage

    Flexible and efficient IR using array databases

    Get PDF
    The Matrix Framework is a recent proposal by IR researchers to flexibly represent all important information retrieval models in a single multi-dimensional array framework. Computational support for exactly this framework is provided by the array database system SRAM (Sparse Relational Array Mapping) that works on top of a DBMS. Information retrieval models can be specified in its comprehension-based array query language, in a way that directly corresponds to the underlying mathematical formulas. SRAM efficiently stores sparse arrays in (compressed) relational tables and translates and optimizes array queries into relational queries. In this work, we describe a number of array query optimization rules and demonstrate their effect on text retrieval in the TREC TeraByte track (TREC-TB) efficiency task, using the Okapi BM25 model as our example. It turns out that these optimization rules enable SRAM to automatically translate the BM25 array queries into the relational equivalent of inverted list processing including compression, score materialization and quantization, such as employed by custom-built IR systems. The use of the high-performance MonetDB/X100 relational backend, that provides transparent database compression, allows the system to achieve very fast response times with good precision and low resource usage

    Techniques for improving efficiency and scalability for the integration of information retrieval and databases

    Get PDF
    PhDThis thesis is on the topic of integration of Information Retrieval (IR) and Databases (DB), with particular focuses on improving efficiency and scalability of integrated IR and DB technology (IR+DB). The main purpose of this study is to develop efficient and scalable techniques for supporting integrated IR and DB technology, which is a popular approach today for handling complex queries over text and structured data. Our specific interest in this thesis is how to efficiently handle queries over large-scale text and structured data. The work is based on a technology that integrates probability theory and relational algebra, where retrievals for text and data are to be expressed in probabilistic logical programs such as probabilistic relational algebra or probabilistic Datalog. To support efficient processing of probabilistic logical programs, we proposed three optimization techniques that focus on aspects covered logical and physical layers, which include: scoring-driven query optimization using scoring expression, query processing with top-k incorporated pipeline, and indexing with relational inverted index. Specifically, scoring expressions are proposed for expressing the scoring or probabilistic semantics of implied scoring functions of PRA expressions, so that efficient query execution plan can be generated by rule-based scoring-driven optimizer. Secondly, to balance efficiency and effectiveness so that to improve query response time, we studied methods for incorporating topk algorithms into pipelined query execution engine for IR+DB systems. Thirdly, the proposed relational inverted index integrates IR-style inverted index and DB-style tuple-based index, which can be used to support efficient probability estimation and aggregation as well as conventional relational operations. Experiments were carried out to investigate the performances of proposed techniques. Experimental results showed that the efficiency and scalability of an IR+DB prototype have been improved, while the system can handle queries efficiently on considerable large data sets for a number of IR tasks

    Database support for large-scale multimedia retrieval

    Get PDF
    With the increasing proliferation of recording devices and the resulting abundance of multimedia data available nowadays, searching and managing these ever-growing collections becomes more and more difficult. In order to support retrieval tasks within large multimedia collections, not only the sheer size, but also the complexity of data and their associated metadata pose great challenges, in particular from a data management perspective. Conventional approaches to address this task have been shown to have only limited success, particularly due to the lack of support for the given data and the required query paradigms. In the area of multimedia research, the missing support for efficiently and effectively managing multimedia data and metadata has recently been recognised as a stumbling block that constraints further developments in the field. In this thesis, we bridge the gap between the database and the multimedia retrieval research areas. We approach the problem of providing a data management system geared towards large collections of multimedia data and the corresponding query paradigms. To this end, we identify the necessary building-blocks for a multimedia data management system which adopts the relational data model and the vector-space model. In essence, we make the following main contributions towards a holistic model of a database system for multimedia data: We introduce an architectural model describing a data management system for multimedia data from a system architecture perspective. We further present a data model which supports the storage of multimedia data and the corresponding metadata, and provides similarity-based search operations. This thesis describes an extensive query model for a very broad range of different query paradigms specifying both logical and executional aspects of a query. Moreover, we consider the efficiency and scalability of the system in a distribution and a storage model, and provide a large and diverse set of index structures for high-dimensional data coming from the vector-space model. Thee developed models crystallise into the scalable multimedia data management system ADAMpro which has been implemented within the iMotion/vitrivr retrieval stack. We quantitatively evaluate our concepts on collections that exceed the current state of the art. The results underline the benefits of our approach and assist in understanding the role of the introduced concepts. Moreover, the findings provide important implications for future research in the field of multimedia data management

    Die Sphere-Search-Suchmaschine zur graphbasierten Suche auf heterogenen, semistrukturierten Daten

    Get PDF
    In dieser Arbeit wird die neuartige SphereSearch-Suchmaschine vorgestellt, die ein einheitliches ranglistenbasiertes Retrieval auf heterogenen XML- und Web-Daten ermöglicht. Ihre FĂ€higkeiten umfassen die Auswertung von vagen Struktur- und Inhaltsbedingungen sowie ein auf IR-Statistiken und einem graph-basierten Datenmodell basierendes Relevanz-Ranking. Web-Dokumente im HTML- und PDFFormat werden zunĂ€chst automatisch in ein XML-Zwischenformat konvertiert und anschließend mit Hilfe von Annotations-Tools durch zusĂ€tzliche Tags semantisch angereichtert. Die graph-basierte Suchmaschine bietet auf semi-strukturierten Daten vielfĂ€ltige Suchmöglichkeiten, die von keiner herkömmlichen Web- oder XMLSuchmaschine ausgedrĂŒckt werden können: konzeptbewusste und kontextbewusste Suche, die sowohl die implizite Struktur von Daten als auch ihren Kontext berĂŒcksichtigt. Die Vorteile der SphereSearch-Suchmaschine werden durch Experimente auf verschiedenen Dokumentenkorpora demonstriert. Diese umfassen eine große, vielfĂ€ltige Tags beinhaltende, nicht-schematische EnzyklopĂ€die, die um externe Dokumente erweitert wurde, sowie einen Standard-XML-Benchmark.This thesis presents the novel SphereSearch Engine that provides unified ranked retrieval on heterogeneous XML andWeb data. Its search capabilities include vague structure and text content conditions, and relevance ranking based on IR statistics and a graph-based data model. Web pages in HTML or PDF are automatically converted into an intermediate XML format, with the option of generating semantic tags by means of linguistic annotation tools. For semi-structured data the graphbased query engine is leveraged to provide very rich search options that cannot be expressed in traditional Web or XML search engines: concept-aware and linkaware querying that takes into account the implicit structure and context of Web pages. The benefits of the SphereSearch engine are demonstrated by experiments with a large and richly tagged but non-schematic open encyclopedia extended with external documents and a standard XML benchmark

    TopX : efficient and versatile top-k query processing for text, structured, and semistructured data

    Get PDF
    TopX is a top-k retrieval engine for text and XML data. Unlike Boolean engines, it stops query processing as soon as it can safely determine the k top-ranked result objects according to a monotonous score aggregation function with respect to a multidimensional query. The main contributions of the thesis unfold into four main points, confirmed by previous publications at international conferences or workshops: ‱ Top-k query processing with probabilistic guarantees. ‱ Index-access optimized top-k query processing. ‱ Dynamic and self-tuning, incremental query expansion for top-k query processing. ‱ Efficient support for ranked XML retrieval and full-text search. Our experiments demonstrate the viability and improved efficiency of our approach compared to existing related work for a broad variety of retrieval scenarios.TopX ist eine Top-k Suchmaschine fĂŒr Text und XML Daten. Im Gegensatz zu Boole\u27; schen Suchmaschinen terminiert TopX die Anfragebearbeitung, sobald die k besten Ergebnisobjekte im Hinblick auf eine mehrdimensionale Anfrage gefunden wurden. Die HauptbeitrĂ€ge dieser Arbeit teilen sich in vier Schwerpunkte basierend auf vorherigen Veröffentlichungen bei internationalen Konferenzen oder Workshops: ‱ Top-k Anfragebearbeitung mit probabilistischen Garantien. ‱ Zugriffsoptimierte Top-k Anfragebearbeitung. ‱ Dynamische und selbstoptimierende, inkrementelle Anfrageexpansion fĂŒr Top-k Anfragebearbeitung. ‱ Effiziente UnterstĂŒtzung fĂŒr XML-Anfragen und Volltextsuche. Unsere Experimente bestĂ€tigen die Vielseitigkeit und gesteigerte Effizienz unserer Verfahren gegenĂŒber existierenden, fĂŒhrenden AnsĂ€tzen fĂŒr eine weite Bandbreite von Anwendungen in der Informationssuche
    corecore