1,078 research outputs found

    Prime Number-Based Hierarchical Data Labeling Scheme for Relational Databases

    Get PDF
    Hierarchical data structures are an important aspect of many computer science fields including data mining, terrain modeling, and image analysis. A good representation of such data accurately captures the parent-child and ancestor-descendent relationships between nodes. There exist a number of different ways to capture and manage hierarchical data while preserving such relationships. For instance, one may use a custom system designed for a specific kind of hierarchy. Object oriented databases may also be used to model hierarchical data. Relational database systems, on the other hand, add an additional benefit of mature mathematical theory, reliable implementations, superior functionality and scalability. Relational databases were not originally designed with hierarchical data management in mind. As a result, abstract information can not be natively stored in database relations. Database labeling schemes resolve this issue by labeling all nodes in a way that reveals their relationships. Labels usually encode the node's position in a hierarchy as a number or a string that can be stored, indexed, searched, and retrieved from a database. Many different labeling schemes have been developed in the past. All of them may be classified into three broad categories: recursive expansion, materialized path, and nested sets. Each model has its strengths and weaknesses. Each model implementation attempts to reduce the number of weaknesses inherent to the respective model. One of the most prominent implementations of the materialized path model uses the unique characteristics of prime numbers for its labeling purposes. However, the performance and space utilization of this prime number labeling scheme could be significantly improved. This research introduces a new scheme called reusable prime number labeling (rPNL) that reduces the effects of the mentioned weaknesses. The proposed scheme advantage is discussed in detail, proven mathematically, and experimentally confirmed

    A Survey on Mapping Semi-Structured Data and Graph Data to Relational Data

    Get PDF
    The data produced by various services should be stored and managed in an appropriate format for gaining valuable knowledge conveniently. This leads to the emergence of various data models, including relational, semi-structured, and graph models, and so on. Considering the fact that the mature relational databases established on relational data models are still predominant in today's market, it has fueled interest in storing and processing semi-structured data and graph data in relational databases so that mature and powerful relational databases' capabilities can all be applied to these various data. In this survey, we review existing methods on mapping semi-structured data and graph data into relational tables, analyze their major features, and give a detailed classification of those methods. We also summarize the merits and demerits of each method, introduce open research challenges, and present future research directions. With this comprehensive investigation of existing methods and open problems, we hope this survey can motivate new mapping approaches through drawing lessons from eachmodel's mapping strategies, aswell as a newresearch topic - mapping multi-model data into relational tables.Peer reviewe

    Level based labeling scheme for extensible markup language (XML) data processing

    Get PDF
    Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2010Includes bibliographical references (leaves: 56-57)Text in English; Abstract: Turkish and Englishx, 70 leavesWith the continuous growth of data in businesses and the increasing demand for reaching that data immediately, raised the need of having real time data warehouses. In order to provide such a system, the ETL mechanism will need to be very efficient on updating data. From the literature surveys, it has been observed that there are many studies performed on efficient update of the relational data, while there is limited amount of study on updating the XML data. With the extensible structure and effective performance on data exchange, the usage of XML data structure is increasing day by day. Like relational databases, real time XML databases also need to be updated continuously. The hierarchic characteristic of XML required the usage of tree representations for indexing the data since they provide necessary means to capture different relationships between the nodes. The principal purpose of this study is to define and compare algorithms which label the XML tree with an effective update mechanism. Proposed labeling algorithms aim to provide a mechanism to query and update the XML data by defining all relations between the nodes. In the experimental evaluation part of this thesis, all algorithms is examined and tested with an existing labeling algorithm

    A Bi-Labeling Based XPath Processing System

    Get PDF
    We present BLAS, a Bi-LAbeling based XPath processing System. BLAS uses two labeling schemes to speed up query processing: P-labeling for processing consecutive child (or parent) axis traversals, and D-labeling for processing descendant (or ancestor) axis traversals. XML data are stored in labeled form and indexed. Algorithms are presented for translating XPath queries to SQL expressions. BLAS reduces the number of joins in the SQL query translated from a given XPath query and reduces the number of disk accesses required to execute the SQL query compared with the traditional XPath processing using D-labeling alone. We also propose an approximate P-labeling scheme and the corresponding query translation algorithm to handle XML data trees that contain a large number of distinct tag names, and/or are very deep. This extension captures a spectrum of XPath-to-SQL query translation schemes, ranging from existing schemes that do not use P-labels to the one that uses exact P-labels. Experimental results demonstrate the efficiency of the BLAS system

    Efficiency and effectiveness of XML keyword search using a full element index

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2010.Thesis (Master's) -- Bilkent University, 2010.Includes bibliographical references leaves 63-67.In the last decade, both the academia and industry proposed several techniques to allow keyword search on XML databases and document collections. A common data structure employed in most of these approaches is an inverted index, which is the state-of-the-art for conducting keyword search over large volumes of textual data, such as world wide web. In particular, a full element-index considers (and indexes) each XML element as a separate document, which is formed of the text directly contained in it and the textual content of all of its descendants. A major criticism for a full element-index is the high degree of redundancy in the index (due to the nested structure of XML documents), which diminishes its usage for large-scale XML retrieval scenarios. As the rst contribution of this thesis, we investigate the e ciency and e ectiveness of using a full element-index for XML keyword search. First, we suggest that lossless index compression methods can signi cantly reduce the size of a full element-index so that query processing strategies, such as those employed in a typical search engine, can e ciently operate on it. We show that once the most essential problem of a full element-index, i.e., its size, is remedied, using such an index can improve both the result quality (e ectiveness) and query execution performance (e ciency) in comparison to other recently proposed techniques in the literature. Moreover, using a full element-index also allows generating query results in di erent forms, such as a ranked list of documents (as expected by a search engine user) or a complete list of elements that include all of the query terms (as expected by a DBMS user), in a uni ed framework. As a second contribution of this thesis, we propose to use a lossy approach, static index pruning, to further reduce the size of a full element-index. In this way, we aim to eliminate the repetition of an element's terms at upper levels in an adaptive manner considering the element's textual content and search system's ranking function. That is, we attempt to remove the repetitions in the index only when we expect that removal of them would not reduce the result quality. We conduct a well-crafted set of experiments and show that pruned index les are comparable or even superior to the full element-index up to very high pruning levels for various ad hoc tasks in terms of retrieval e ectiveness. As a nal contribution of this thesis, we propose to apply index pruning strategies to reduce the size of the document vectors in an XML collection to improve the clustering performance of the collection. Our experiments show that for certain cases, it is possible to prune up to 70% of the collection (or, more speci cally, underlying document vectors) and still generate a clustering structure that yields the same quality with that of the original collection, in terms of a set of evaluation metrics.Atılgan, DuyguM.S

    Clustering-based Labelling Scheme - A Hybrid Approach for Efficient Querying and Updating XML Documents

    Get PDF
    Extensible Markup Language (XML) has become a dominant technology for transferring data through the worldwide web. The XML labelling schemes play a key role in handling XML data efficiently and robustly. Thus, many labelling schemes have been proposed. However, these labelling schemes have limitations and shortcomings. Thus, the aim of this research was to investigate the existing XML labelling schemes and their limitations in order to address the issue of efficiency of XML query performance. This thesis investigated the existing labelling schemes and classified them into three categories based on certain criteria, in order to identify the limitations and challenges of these labelling schemes. Based on the outcomes of this investigation, this thesis proposed a state-of-theart labelling scheme, called clustering-based labelling scheme, to resolve or improve the key limitations such as the efficiency of the XML query processing, labelling XML nodes, and XML updates cost. This thesis argued that using certain existing labelling schemes to label nodes, and using the clustering-based techniques can improve query and labelling nodes efficiency. Theoretically, the proposed scheme is based on dividing the nodes of an XML document into clusters. Two existing labelling schemes, which are the Dewey and LLS labelling schemes, were selected for labelling these clusters and their nodes. Subsequently, the proposed scheme was designed and implemented. In addition, the Dewey and LLS labelling scheme were implemented for the purpose of evaluating the proposed scheme. Subsequently, four experiments were designed in order to test the proposed scheme against the Dewey and LLS labelling schemes. The results of these experiments suggest that the proposed scheme achieved better results than the Dewey and LLS schemes. Consequently, the research hypothesis was accepted overall with few exceptions, and the proposed scheme showed an improvement in the performance and all the targeted features and aspects

    XML retrieval using pruned element-index files

    Get PDF
    An element-index is a crucial mechanism for supporting content-only (CO) queries over XML collections. A full element-index that indexes each element along with the content of its descendants involves a high redundancy and reduces query processing efficiency. A direct index, on the other hand, only indexes the content that is directly under each element and disregards the descendants. This results in a smaller index, but possibly in return to some reduction in system effectiveness. In this paper, we propose using static index pruning techniques for obtaining more compact index files that can still result in comparable retrieval performance to that of a full index. We also compare the retrieval performance of these pruning based approaches to some other strategies that make use of a direct element-index. Our experiments conducted along with the lines of INEX evaluation framework reveal that pruned index files yield comparable to or even better retrieval performance than the full index and direct index, for several tasks in the ad hoc track. © 2010 Springer-Verlag Berlin Heidelberg

    Order based labeling scheme for dynamic XML (extensible markup language) query processing

    Get PDF
    Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2012Includes bibliographical references (leaves: 43-46)Text in English; Abstract: Turkish and Englishix, 55 leavesNeed for robust and high performance XML database systems increased due to growing XML data produced by today’s applications. Like indexes in relational databases, XML labeling is the key to XML querying. Assigning unique labels to nodes of a dynamic XML tree in which the labels encode all structural relationships between the nodes is a challenging problem. Early labeling schemes designed for static XML document generate short labels; however, their performance degrades in update intensive environments due to the need for relabeling. On the other hand, dynamic labeling schemes achieve dynamicity at the cost of large label size or complexity which results in poor query performance. This thesis presents OrderBased labeling scheme which is dynamic, simple and compact yet able to identify structural relationships among nodes. A set of performance tests show promising labeling, querying, update performance and optimum label size

    End-to-End Entity Resolution for Big Data: A Survey

    Get PDF
    One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions
    corecore