82 research outputs found

    Intuitionistic fuzzy XML query matching and rewriting

    Get PDF
    With the emergence of XML as a standard for data representation, particularly on the web, the need for intelligent query languages that can operate on XML documents with structural heterogeneity has recently gained a lot of popularity. Traditional Information Retrieval and Database approaches have limitations when dealing with such scenarios. Therefore, fuzzy (flexible) approaches have become the predominant. In this thesis, we propose a new approach for approximate XML query matching and rewriting which aims at achieving soft matching of XML queries with XML data sources following different schemas. Unlike traditional querying approaches, which require exact matching, the proposed approach makes use of Intuitionistic Fuzzy Trees to achieve approximate (soft) query matching. Through this new approach, not only the exact answer of a query, but also approximate answers are retrieved. Furthermore, partial results can be obtained from multiple data sources and merged together to produce a single answer to a query. The proposed approach introduced a new tree similarity measure that considers the minimum and maximum degrees of similarity/inclusion of trees that are based on arc matching. New techniques for soft node and arc matching were presented for matching queries against data sources with highly varied structures. A prototype was developed to test the proposed ideas and it proved the ability to achieve approximate matching for pattern queries with a number of XML schemas and rewrite the original query so that it obtain results from the underlying data sources. This has been achieved through several novel algorithms which were tested and proved efficiency and low CPU/Memory cost even for big number of data sources

    A survey on tree matching and XML retrieval

    Get PDF
    International audienceWith the increasing number of available XML documents, numerous approaches for retrieval have been proposed in the literature. They usually use the tree representation of documents and queries to process them, whether in an implicit or explicit way. Although retrieving XML documents can be considered as a tree matching problem between the query tree and the document trees, only a few approaches take advantage of the algorithms and methods proposed by the graph theory. In this paper, we aim at studying the theoretical approaches proposed in the literature for tree matching and at seeing how these approaches have been adapted to XML querying and retrieval, from both an exact and an approximate matching perspective. This study will allow us to highlight theoretical aspects of graph theory that have not been yet explored in XML retrieval

    Solving the intractable problem: optimal performance for worst case scenarios in XML twig pattern matching

    Get PDF
    In the history of databases, eXtensible Markup Language (XML) has been thought of as the standard format to store and exchange semi-structured data. With the advent of IoT, XML technologies can play an important role in addressing the issue of processing a massive amount of data generated from heterogeneous devices. As the number and complexity of such datasets increases there is a need for algorithms which are able to index and retrieve XML data efficiently even for complex queries. In this context twig pattern matching , finding all occurrences of a twig pattern query (TPQ), is a core operation in XML query processing. Until now holistic joins have been considered the state-of-the-art TPQ processing algorithms, but they fail to guarantee an optimal evaluation except at the expense of excessive storage costs which limit their scope in large datasets. In this article, we introduce a new approach which significantly outperforms earlier methods in terms of both the size of the intermediate storage and query running time. The approach presented here uses Child Prime Labels (Alsubai & North, 2018) to improve the filtering phase of bottom-up twig matching algorithms and a novel algorithm which avoids the use of stacks, thus improving TPQs processing efficiency. Several experiments were conducted on common benchmarks such as DBLP, XMark and TreeBank datasets to study the performance of the new approach. Multiple analyses on a range of twig pattern queries are presented to demonstrate the statistical significance of the improvements

    Optimized Indexes for Data Structured Retrieval

    Get PDF
    The aim of this work is to show the novel index structure based suffix array and ternary search tree with rank and select succinct data structure. Suffix arrays were originally developed to reduce memory consumption compared to a suffix tree and ternary search tree combine the time efficiency of digital tries with the space efficiency of binary search trees. Rank of a symbol at a given position equals the number of times the symbol appears in the corresponding prefix of the sequence. Select is the inverse, retrieving the positions of the symbol occurrences. These operations are widely used in information retrieval and management, being the base of several data structures and algorithms for text collections, graphs, trees, etc. The resulting structure is faster than hashing for many typical search problems, and supports a broader range of useful problems and operations. There for we implement a path index based on those data structures that shown to be highly efficient when dealing with digital collection consist in structured documents. We describe how the index architecture works and we compare the searching algorithms with others, and finally experiments show the outperforms with earlier approaches

    Grade And Exact In Order Of Textual Substance

    Get PDF
    Ranking and returning the most relevant results for a question is probably the most popular form of XML query processing. To resolve this issue, we first suggest an elegant framework for query relaxation processes to support difficult XML queries. The solutions on which this framework is based are not required, however, to satisfy the precisely defined query syntax, as they can be based on the qualities that can be deduced in the initial query. It does not have the power to elegantly combine structures and content to answer comfortable questions. In our solution, we classify nodes into two groups: categorical nodes and statistical nodes and pattern-based approaches in assessing the similarity relationship of categorical nodes and statistical nodes. We continue to use a comprehensive set of experiences to demonstrate the effectiveness of our proposed approach to the accuracy and recovery of values. Querying XML data often becomes difficult in practical applications because the hierarchical structure of XML documents can be heterogeneous, so any slight misunderstanding of the document structure can certainly increase the risk of unsatisfactory queries. This is very difficult, especially given that such queries produce empty solutions, even if there are no translation errors. In addition, we design a non-periodic evidence-based vector diagram to create and adjust the weakening of the structure and develop an inefficient evaluation parameter to evaluate the similarity relationship on structures. So, we design a new approach to take the highest k that can intelligently create the most promising solutions in a linked order using the ranking scale

    Efficient similarity computations on parallel machines using data shaping

    Get PDF
    Similarity computation is a fundamental operation in all forms of data. Big Data is, typically, characterized by attributes such as volume, velocity, variety, veracity, etc. In general, Big Data variety appears as structured, semi-structured or unstructured forms. The volume of Big Data in general, and semi-structured data in particular, is increasing at a phenomenal rate. Big Data phenomenon is posing new set of challenges to similarity computation problems occurring in semi-structured data. Technology and processor architecture trends suggest very strongly that future processors shall have ten\u27s of thousands of cores (hardware threads). Another crucial trend is that ratio between on-chip and off-chip memory to core counts is decreasing. State-of-the-art parallel computing platforms such as General Purpose Graphics Processors (GPUs) and MICs are promising for high performance as well high throughput computing. However, processing semi-structured component of Big Data efficiently using parallel computing systems (e.g. GPUs) is challenging. Reason being most of the emerging platforms (e.g. GPUs) are organized as Single Instruction Multiple Thread/Data machines which are highly structured, where several cores (streaming processors) operate in lock-step manner, or they require a high degree of task-level parallelism. We argue that effective and efficient solutions to key similarity computation problems need to operate in a synergistic manner with the underlying computing hardware. Moreover, semi-structured form input data needs to be shaped or reorganized with the goal to exploit the enormous computing power of \textit{state-of-the-art} highly threaded architectures such as GPUs. For example, shaping input data (via encoding) with minimal data-dependence can facilitate flexible and concurrent computations on high throughput accelerators/co-processors such as GPU, MIC, etc. We consider various instances of traditional and futuristic problems occurring in intersection of semi-structured data and data analytics. Preprocessing is an operation common at initial stages of data processing pipelines. Typically, the preprocessing involves operations such as data extraction, data selection, etc. In context of semi-structured data, twig filtering is used in identifying (and extracting) data of interest. Duplicate detection and record linkage operations are useful in preprocessing tasks such as data cleaning, data fusion, and also useful in data mining, etc., in order to find similar tree objects. Likewise, tree edit is a fundamental metric used in context of tree problems; and similarity computation between trees another key problem in context of Big Data. This dissertation makes a case for platform-centric data shaping as a potent mechanism to tackle the data- and architecture-borne issues in context of semi-structured data processing on GPU and GPU-like parallel architecture machines. In this dissertation, we propose several data shaping techniques for tree matching problems occurring in semi-structured data. We experiment with real world datasets. The experimental results obtained reveal that the proposed platform-centric data shaping approach is effective for computing similarities between tree objects using GPGPUs. The techniques proposed result in performance gains up to three orders of magnitude, subject to problem and platform

    31. međunarodna konferencija Very Large Data Bases

    Get PDF
    Dana je vijest o održanoj 31. međunarodnoj konferenciji Very Large Data Bases
    corecore