20 research outputs found

    Intuitionistic fuzzy XML query matching and rewriting

    Get PDF
    With the emergence of XML as a standard for data representation, particularly on the web, the need for intelligent query languages that can operate on XML documents with structural heterogeneity has recently gained a lot of popularity. Traditional Information Retrieval and Database approaches have limitations when dealing with such scenarios. Therefore, fuzzy (flexible) approaches have become the predominant. In this thesis, we propose a new approach for approximate XML query matching and rewriting which aims at achieving soft matching of XML queries with XML data sources following different schemas. Unlike traditional querying approaches, which require exact matching, the proposed approach makes use of Intuitionistic Fuzzy Trees to achieve approximate (soft) query matching. Through this new approach, not only the exact answer of a query, but also approximate answers are retrieved. Furthermore, partial results can be obtained from multiple data sources and merged together to produce a single answer to a query. The proposed approach introduced a new tree similarity measure that considers the minimum and maximum degrees of similarity/inclusion of trees that are based on arc matching. New techniques for soft node and arc matching were presented for matching queries against data sources with highly varied structures. A prototype was developed to test the proposed ideas and it proved the ability to achieve approximate matching for pattern queries with a number of XML schemas and rewrite the original query so that it obtain results from the underlying data sources. This has been achieved through several novel algorithms which were tested and proved efficiency and low CPU/Memory cost even for big number of data sources

    Content based dissemination of XML data

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Semantics and efficient evaluation of partial tree-pattern queries on XML

    Get PDF
    Current applications export and exchange XML data on the web. Usually, XML data are queried using keyword queries or using the standard structured query language XQuery the core of which consists of the navigational query language XPath. In this context, one major challenge is the querying of the data when the structure of the data sources is complex or not fully known to the user. Another challenge is the integration of multiple data sources that export data with structural differences and irregularities. In this dissertation, a query language for XML called Partial Tree-Pattern Query (PTPQ) language is considered. PTPQs generalize and strictly contain Tree-Pattern Queries (TPQs) and can express a broad structural fragment of XPath. Because of their expressive power and flexibility, they are useful for querying XML documents the structure of which is complex or not fully known to the user, and for integrating XML data sources with different structures. The dissertation focuses on three issues. The first one is the design of efficient non-main-memory evaluation methods for PTPQs. The second one is the assignment of semantics to PTPQs so that they return meaningful answers. The third one is the development of techniques for answering TPQs using materialized views. Non-main-memory XML query evaluation can be done in two modes (which also define two evaluation models). In the first mode, data is preprocessed and indexes, called inverted lists, are built for it. In the second mode, data are unindexed and arrives continuously in the form of a stream. Existing algorithms cannot be used directly or indirectly to efficiently compute PTPQs in either mode. Initially, the problem of efficiently evaluating partial path queries in the inverted lists model has been addressed. Partial path queries form a subclass of PTPQs which is not contained in the class of TPQs. Three novel algorithms for evaluating partial path queries including a holistic one have been designed. The analytical and experimental results show that the holistic algorithm outperforms the other two. These results have been extended into holistic and non-holistic approaches for PTPQs in the inverted lists model. The experiments show again the superiority of the holistic approach. The dissertation has also addressed the problem of evaluating PTPQs in the streaming model, and two original efficient streaming algorithms for PTPQs have been designed. Compared to the only known streaming algorithm that supports an extension of TPQs, the experimental results show that the proposed algorithms perform better by orders of magnitude while consuming a much smaller fraction of memory space. An original approach for assigning semantics to PTPQs has also been devised. The novel semantics seamlessly applies to keyword queries and to queries with structural restrictions. In contrast to previous approaches that operate locally on data, the proposed approach operates globally on structural summaries of data to extract tree patterns. Compared to previous approaches, an experimental evaluation shows that our approach has a perfect recall both for XML documents with complete and with incomplete data. It also shows better precision compared to approaches with similar recall. Finally, the dissertation has addressed the problem of answering XML queries using exclusively materialized views. An original approach for materializing views in the context of the inverted lists model has been suggested. Necessary and sufficient conditions have been provided for tree-pattern query answerability in terms of view-to-query homomorphisms. A time and space efficient algorithm was designed for deciding query answerability and a technique for computing queries over view materializations using stack- based holistic algorithms was developed. Further, optimizations were developed which (a) minimize the storage space and avoid redundancy by materializing views as bitmaps, and (b) optimize the evaluation of the queries over the views by applying bitwise operations on view materializations. The experimental results show that the proposed approach obtains largely higher hit rates than previous approaches, speeds up significantly the evaluation of queries without using views, and scales very smoothly in terms of storage space and computational overhead

    Child Prime Label Approaches to Evaluate XML Structured Queries

    Get PDF
    The adoption of the eXtensible Markup Language (XML) as the standard format to store and exchange semi-structure data has been gaining momentum. The growing number of XML documents leads to the need for appropriate XML querying algorithms which are able to retrieve XML data efficiently. Due to the importance of twig pattern matching in XML retrieval systems, finding all matching occurrences of a tree pattern query in an XML document is often considered as a specific task for XML databases as well as a core operation in XML query processing. This thesis presents a design and implementation of a new indexing technique, called the Child Prime Label (CPL) which exploits the property of prime numbers to identify Parent-Child (P-C) edges in twig pattern queries (TPQs) during query evaluation. The CPL approach can be incorporated efficiently within the existing labelling schemes. The major contributions of this thesis can be seen as a set of novel twig matching algorithms which apply the CPL approach and focus on reducing the overhead of storing useless elements and performing unnecessary computations during the output enumeration. The research presented here is the first to provide an efficient and general solution for TPQs containing ordering constraints and positional predicates specified by the XML query languages. To evaluate the CPL approaches, the holistic model was implemented as an experimental prototype in which the approaches proposed are compared against state-of-the-art holistic twig algorithms. Extensive performance studies on various real-world and artificial datasets were conducted to demonstrate the significant improvement of the CPL approaches over the previous indexing and querying methods. The experimental results demonstrate the validity and improvements of the new algorithms over other related methods on common various subclasses of TPQs. Moreover, the scalability tests reveal that the new algorithms are more suitable for processing large XML datasets

    Keyword-Based Querying for the Social Semantic Web

    Get PDF
    Enabling non-experts to publish data on the web is an important achievement of the social web and one of the primary goals of the social semantic web. Making the data easily accessible in turn has received only little attention, which is problematic from the point of view of incentives: users are likely to be less motivated to participate in the creation of content if the use of this content is mostly reserved to experts. Querying in semantic wikis, for example, is typically realized in terms of full text search over the textual content and a web query language such as SPARQL for the annotations. This approach has two shortcomings that limit the extent to which data can be leveraged by users: combined queries over content and annotations are not possible, and users either are restricted to expressing their query intent using simple but vague keyword queries or have to learn a complex web query language. The work presented in this dissertation investigates a more suitable form of querying for semantic wikis that consolidates two seemingly conflicting characteristics of query languages, ease of use and expressiveness. This work was carried out in the context of the semantic wiki KiWi, but the underlying ideas apply more generally to the social semantic and social web. We begin by defining a simple modular conceptual model for the KiWi wiki that enables rich and expressive knowledge representation. A component of this model are structured tags, an annotation formalism that is simple yet flexible and expressive, and aims at bridging the gap between atomic tags and RDF. The viability of the approach is confirmed by a user study, which finds that structured tags are suitable for quickly annotating evolving knowledge and are perceived well by the users. The main contribution of this dissertation is the design and implementation of KWQL, a query language for semantic wikis. KWQL combines keyword search and web querying to enable querying that scales with user experience and information need: basic queries are easy to express; as the search criteria become more complex, more expertise is needed to formulate the corresponding query. A novel aspect of KWQL is that it combines both paradigms in a bottom-up fashion. It treats neither of the two as an extension to the other, but instead integrates both in one framework. The language allows for rich combined queries of full text, metadata, document structure, and informal to formal semantic annotations. KWilt, the KWQL query engine, provides the full expressive power of first-order queries, but at the same time can evaluate basic queries at almost the speed of the underlying search engine. KWQL is accompanied by the visual query language visKWQL, and an editor that displays both the textual and visual form of the current query and reflects changes to either representation in the other. A user study shows that participants quickly learn to construct KWQL and visKWQL queries, even when given only a short introduction. KWQL allows users to sift the wealth of structure and annotations in an information system for relevant data. If relevant data constitutes a substantial fraction of all data, ranking becomes important. To this end, we propose PEST, a novel ranking method that propagates relevance among structurally related or similarly annotated data. Extensive experiments, including a user study on a real life wiki, show that pest improves the quality of the ranking over a range of existing ranking approaches

    Seventh Biennial Report : June 2003 - March 2005

    No full text

    Distributed XML Query Processing

    Get PDF
    While centralized query processing over collections of XML data stored at a single site is a well understood problem, centralized query evaluation techniques are inherently limited in their scalability when presented with large collections (or a single, large document) and heavy query workloads. In the context of relational query processing, similar scalability challenges have been overcome by partitioning data collections, distributing them across the sites of a distributed system, and then evaluating queries in a distributed fashion, usually in a way that ensures locality between (sub-)queries and their relevant data. This thesis presents a suite of query evaluation techniques for XML data that follow a similar approach to address the scalability problems encountered by XML query evaluation. Due to the significant differences in data and query models between relational and XML query processing, it is not possible to directly apply distributed query evaluation techniques designed for relational data to the XML scenario. Instead, new distributed query evaluation techniques need to be developed. Thus, in this thesis, an end-to-end solution to the scalability problems encountered by XML query processing is proposed. Based on a data partitioning model that supports both horizontal and vertical fragmentation steps (or any combination of the two), XML collections are fragmented and distributed across the sites of a distributed system. Then, a suite of distributed query evaluation strategies is proposed. These query evaluation techniques ensure locality between each fragment of the collection and the parts of the query corresponding to the data in this fragment. Special attention is paid to scalability and query performance, which is achieved by ensuring a high degree of parallelism during distributed query evaluation and by avoiding access to irrelevant portions of the data. For maximum flexibility, the suite of distributed query evaluation techniques proposed in this thesis provides several alternative approaches for evaluating a given query over a given distributed collection. Thus, to achieve the best performance, it is necessary to predict and compare the expected performance of each of these alternatives. In this work, this is accomplished through a query optimization technique based on a distribution-aware cost model. The same cost model is also used to fine-tune the way a collection is fragmented to the demands of the query workload evaluated over this collection. To evaluate the performance impact of the distributed query evaluation techniques proposed in this thesis, the techniques were implemented within a production-quality XML database system. Based on this implementation, a thorough experimental evaluation was performed. The results of this evaluation confirm that the distributed query evaluation techniques introduced here lead to significant improvements in query performance and scalability both when compared to centralized techniques and when compared to existing distributed query evaluation techniques

    Query Evaluation in the Presence of Fine-grained Access Control

    Get PDF
    Access controls are mechanisms to enhance security by protecting data from unauthorized accesses. In contrast to traditional access controls that grant access rights at the granularity of the whole tables or views, fine-grained access controls specify access controls at finer granularity, e.g., individual nodes in XML databases and individual tuples in relational databases. While there is a voluminous literature on specifying and modeling fine-grained access controls, less work has been done to address the performance issues of database systems with fine-grained access controls. This thesis addresses the performance issues of fine-grained access controls and proposes corresponding solutions. In particular, the following issues are addressed: effective storage of massive access controls, efficient query planning for secure query evaluation, and accurate cardinality estimation for access controlled data. Because fine-grained access controls specify access rights from each user to each piece of data in the system, they are effectively a massive matrix of the size as the product of the number of users and the size of data. Therefore, fine-grained access controls require a very compact encoding to be feasible. The proposed storage system in this thesis achieves an unprecedented level of compactness by leveraging the high correlation of access controls found in real system data. This correlation comes from two sides: the structural similarity of access rights between data, and the similarity of access patterns from different users. This encoding can be embedded into a linearized representation of XML data such that a query evaluation framework is able to compute the answer to the access controlled query with minimal disk I/O to the access controls. Query optimization is a crucial component for database systems. This thesis proposes an intelligent query plan caching mechanism that has lower amortized cost for query planning in the presence of fine-grained access controls. The rationale behind this query plan caching mechanism is that the queries, customized by different access controls from different users, may share common upper-level join trees in their optimal query plans. Since join plan generation is an expensive step in query optimization, reusing the upper-level join trees will reduce query optimization significantly. The proposed caching mechanism is able to match efficient query plans to access controlled query plans with minimal runtime cost. In case of a query plan cache miss, the optimizer needs to optimize an access controlled query from scratch. This depends on accurate cardinality estimation on the size of the intermediate query results. This thesis proposes a novel sampling scheme that has better accuracy than traditional cardinality estimation techniques

    Rank-aware, Approximate Query Processing on the Semantic Web

    Get PDF
    Search over the Semantic Web corpus frequently leads to queries having large result sets. So, in order to discover relevant data elements, users must rely on ranking techniques to sort results according to their relevance. At the same time, applications oftentimes deal with information needs, which do not require complete and exact results. In this thesis, we face the problem of how to process queries over Web data in an approximate and rank-aware fashion

    The Graph Pattern Matching Problem through Parameterized Matching

    Get PDF
    We propose a new approach to solve graph isomorphism using parameterized matching. Parameterized matching is a string matching problem where two strings parameterized-match if there exists a bijective function, on the symbols of the alphabet, that maps one of the strings into the other. Given that parameterized matching is defined for linear structures, we define the concept of graph linearization to represent the topology of a graph as a walk on it. Then, our approach to determine whether two graphs are isomorphic consists of determining whether there exists a walk in one of the graphs that parameterized-matches a linearization of the other graph. Our solution has two main steps: linearization and matching. We develop an efficient linearization algorithm, that generates short linearizations with an approximation guarantee, and develop a graph matching algorithm. We show that this solution also works for subgraph isomorphism, which is the problem of determining whether an input graph H is isomorphic to a subgraph of another input graph G. We evaluate our approach experimentally on graphs of different types and sizes, and compare to the performance of VF2, which is a prominent algorithm for graph isomorphism. Our empirical measurements show that graph linearization finds a matching graph faster than VF2 in many cases, especially in Miyazaki-constructed graphs which are known to be one of the hardest cases for graph isomorphism algorithms. We extend this approach to query attributed graphs. An attributed graph is a graph data structure, in which nodes and edges may have identifiers, types and other attributes. Attributed graphs are used in many application domains, for example to model social networks in which nodes represent people, photos, and postings and edges represent friendship, person-tagged-in-photo and mentioned-in-post relationships. Queries are used to extract information from such graphs. Several graph queries are expressed as graph pattern matching, which is the problem of finding all instances of pattern match query P in a larger attributed graph G. A pattern match query may specify both a graph structure and predicates on the attributes of the graph elements. Clearly, this problem is associated to subgraph isomorphism. Furthermore, we define a more general class of graph queries called generalized pattern queries on attributed multigraphs. The goal of this class is to find paths and subgraphs that satisfy query reachability and predicates. The query language is expressive: It allows (i) using regular expression operators (e.g., Kleene star and union); (ii) specifying structural predicates on graph nodes and edges; and (iii) using attribute predicates on nodes and edges. Pattern match queries, reachability queries, their combination, and even more queries can be expressed through generalized pattern queries. We use our approach to solve this new type of queries. The proposed technique has two phases. First, the query is linearized, i.e., represented as a graph walk that covers all nodes and edges. There are several linearizations for a given query; we derive heuristics to produce a good linearization that is short and places selective predicates early in the linearization. Second, we search for a bijective function that maps each element of the query to an element of the attributed multigraph that satisfies the reachability requirements and the predicates. Specifically, we develop an algorithm that matches the linearization by traversing the attributed graph in a manner similar to a breadth first traversal constrained by the linearization. We evaluate our solution experimentally using a real graph (the DBLP citation network) to assess its practicality and efficiency. Our results show that our techniques and optimizations are effective in querying attributed graphs, offering several factors of reduction in query response time when graph statistics are utilized.Resumen. En esta tesis se propone un nuevo enfoque de solución para resolver el problema de isomorfismo de grafos usando búsqueda parametrizada. La búsqueda parametrizada es un problema de búsqueda de cadenas de texto en el cual dos cadenas coinciden si existe una biyección que mapee los símbolos de una cadena en los símbolos de la otra. Dado que la búsqueda parametrizada está definida para estructuras lineales, se define el concepto de linearización de grafos para representar la topología de un grafo como un camino sobre este. Entonces, la solución para determinar si dos grafos son isomorfos consiste en determinar si existe un camino en uno de los grafos que haga coincidencia parametrizada con la linearización del otro grafo. La solución propuesta tiene dos pasos: linearización y búsqueda. Se presenta un algoritmo eficiente que genera linearizaciones aproximadamente óptimas en longitud, y un algoritmo de búsqueda. Se demuestra que esta solución también resuelve el problema de isomorfismo de subgrafos, en el cual se determina si un grafo H es isomorfo a un subgrafo de otro grafo G. Se evaluó experimentalmente la solución con grafos de diferentes tipos y tamaños. Se comparó su desempeño con el de VF2, el cual es un algoritmo competitivo de isomorfismo de grafos. Los resultados experimentales muestran que la solución propuesta es más eficiente que VF2 en varios casos, en especial en grafos Miyazaki, los cuales se caracterizan por constituir uno de los casos más difíciles para los algoritmos de isomorfismo de grafos. Este enfoque de solución se extiende para resolver consultas sobre grafos semánticos. Un grafo semántico es un grafo en el cual los nodes y arcos pueden tener identificadores, tipos y otros atributos. Estos grafos tienen aplicaciones importantes en diversas áreas, como por ejemplo para modelar redes sociales en las que los nodos representan personas, fotos y publicaciones y los arcos representan relaciones de amistad, etiquetado y mención. Se usan consultas para extraer información de estos grafos. Varias de estas consultas se expresan como búsqueda de patrones, la cual consiste en encontrar las coincidencias del grafo patrón P en un grafo semántico G. El grafo patrón especifica tanto la estructura de las coincidencias a encontrar, como los predicados sobre los atributos que deben cumplir los nodos y los arcos de las mismas. Claramente, este problema está asociado al isomorfismo de subgrafos. Además, se define un tipo de consultas más general sobre grafos semánticos. Estos nuevos patrones se denominan grafos patrón generalizados. El objetivo de estos es encontrar caminos y subgrafos que satisfagan ciertos requisitos semánticos, de estructura y de alcance. Estos patrones son expresivos, pues permiten (i) usar operadores de expresiones regulares (e.g., la estrella de Kleene y la unión); (ii) especificar predicados estructurales en los nodos y arcos; y (iii) evaluar predicados sobre los atributos de los nodos y arcos. Los patrones grafo tradicionales, las consultas de alcance, la combinación de estos y más se pueden representar a través de grafos patrón generalizados. Se usa el enfoque de solución propuesto para resolver los grafos patrón generalizados. La solución tiene dos fases. Primero, el patrón es linearizado, i.e., representado como un camino que incluye todos sus nodos y arcos. Hay muchas linearizaciones para un patrón dado; se proponen heurísticas para producir linearizaciones cortas que ubican los predicados selectivos al comienzo. Segundo, se busca una función biyectiva que mapee cada nodo en el patrón a un nodo en el grafo semántico que satisfaga los requisitos de alcance y los predicados. Específicamente, se propone un algoritmo de búsqueda que recorre el grafo semántico siguiendo una búsqueda en amplitud restringida por la linearización. La solución se evaluó experimentalmente usando un grafo semántico real (la red de citaciones DBLP) para evaluar su practicidad y eficiencia. Los resultados experimentales muestran que las técnicas y optimizaciones propuestas son efectivas en consultar grafos semánticos, ofreciendo un alto factor de reducción en el tiempo de ejecución cuando se utilizan las estadísticas del grafo semántico.Doctorad
    corecore