599 research outputs found

    Content-Aware DataGuides for Indexing Large Collections of XML Documents

    Get PDF
    XML is well-suited for modelling structured data with textual content. However, most indexing approaches perform structure and content matching independently, combining the retrieved path and keyword occurrences in a third step. This paper shows that retrieval in XML documents can be accelerated significantly by processing text and structure simultaneously during all retrieval phases. To this end, the Content-Aware DataGuide (CADG) enhances the wellknown DataGuide with (1) simultaneous keyword and path matching and (2) a precomputed content/structure join. Extensive experiments prove the CADG to be 50-90% faster than the DataGuide for various sorts of query and document, including difficult cases such as poorly structured queries and recursive document paths. A new query classification scheme identifies precise query characteristics with a predominant influence on the performance of the individual indices. The experiments show that the CADG is applicable to many real-world applications, in particular large collections of heterogeneously structured XML documents

    WAQS : a web-based approximate query system

    Get PDF
    The Web is often viewed as a gigantic database holding vast stores of information and provides ubiquitous accessibility to end-users. Since its inception, the Internet has experienced explosive growth both in the number of users and the amount of content available on it. However, searching for information on the Web has become increasingly difficult. Although query languages have long been part of database management systems, the standard query language being the Structural Query Language is not suitable for the Web content retrieval. In this dissertation, a new technique for document retrieval on the Web is presented. This technique is designed to allow a detailed retrieval and hence reduce the amount of matches returned by typical search engines. The main objective of this technique is to allow the query to be based on not just keywords but also the location of the keywords within the logical structure of a document. In addition, the technique also provides approximate search capabilities based on the notion of Distance and Variable Length Don\u27t Cares. The proposed techniques have been implemented in a system, called Web-Based Approximate Query System, which contains an SQL-like query language called Web-Based Approximate Query Language. Web-Based Approximate Query Language has also been integrated with EnviroDaemon, an environmental domain specific search engine. It provides EnviroDaemon with more detailed searching capabilities than just keyword-based search. Implementation details, technical results and future work are presented in this dissertation

    Flexible queries in distributed structured and semistructured documents

    Get PDF
    The goal of the thesis is to solve the task of defining a flexible query language for distributed structured and semistructured data and to implement this language. For example, data can be represented as HTML files in the file system, whereby an XML document contains metadata about these files. Hence, these two data sources have to be merged appropriately. In order to satisfy the needs of the heterogeneous group of users, a flexible adaptable query language is required. Therefore a query model is developed which serves as a basis for an abstract syntax for query languages; this abstract syntax can be translated in concrete textual as well as graphical query languages. An implementation of the ideas proposed in the thesis is done in the area of virtual courses, especially in the project ”Virtual University of Applied Sciences”

    Graph Summarization

    Full text link
    The continuous and rapid growth of highly interconnected datasets, which are both voluminous and complex, calls for the development of adequate processing and analytical techniques. One method for condensing and simplifying such datasets is graph summarization. It denotes a series of application-specific algorithms designed to transform graphs into more compact representations while preserving structural patterns, query answers, or specific property distributions. As this problem is common to several areas studying graph topologies, different approaches, such as clustering, compression, sampling, or influence detection, have been proposed, primarily based on statistical and optimization methods. The focus of our chapter is to pinpoint the main graph summarization methods, but especially to focus on the most recent approaches and novel research trends on this topic, not yet covered by previous surveys.Comment: To appear in the Encyclopedia of Big Data Technologie

    An efficient and scalable algorithm for clustering XML documents by structure

    Full text link

    Querying websites using compact skeletons

    Get PDF
    AbstractSeveral commercial applications, such as online comparison shopping and process automation, require integrating information that is scattered across multiple websites or XML documents. Much research has been devoted to this problem, resulting in several research prototypes and commercial implementations. Such systems rely on wrappers that provide relational or other structured interfaces to websites. Traditionally, wrappers have been constructed by hand on a per-website basis, constraining the scalability of the system. We introduce a website structure inference mechanism called compact skeletons that is a step in the direction of automated wrapper generation. Compact skeletons provide a transformation from websites or other hierarchical data, such as XML documents, to relational tables. We study several classes of compact skeletons and provide polynomial-time algorithms and heuristics for automated construction of compact skeletons from websites. Experimental results show that our heuristics work well in practice. We also argue that compact skeletons are a natural extension of commercially deployed techniques for wrapper construction

    Answering Regular Path Queries Under Approximate Semantics in Lightweight Description Logics

    Get PDF
    Classical regular path queries (RPQs) can be too restrictive for some applications and answering such queries under approximate semantics to relax the query is desirable. While for answering regular path queries over graph databases under approximate semantics algorithms are available, such algorithms are scarce for the ontology-mediated setting. In this paper we extend an approach for answering RPQs over graph databases that uses weighted transducers to approximate paths from the query in two ways. The first extension is to answering approximate conjunctive 2-way regular path queries (C2RPQs) over graph databases and the second is to answering C2RPQs over ELH and DL-LiteR ontologies. We provide results on the computational complexity of the underlying reasoning problems and devise approximate query answering algorithms
    • …
    corecore