Search CORE

16 research outputs found

TopX : efficient and versatile top-k query processing for text, structured, and semistructured data

Author: Theobald Martin
Publication venue: Sonstige Einrichtungen. Sonstige Einrichtungen
Publication date: 01/01/2006
Field of study

TopX is a top-k retrieval engine for text and XML data. Unlike Boolean engines, it stops query processing as soon as it can safely determine the k top-ranked result objects according to a monotonous score aggregation function with respect to a multidimensional query. The main contributions of the thesis unfold into four main points, confirmed by previous publications at international conferences or workshops: • Top-k query processing with probabilistic guarantees. • Index-access optimized top-k query processing. • Dynamic and self-tuning, incremental query expansion for top-k query processing. • Efficient support for ranked XML retrieval and full-text search. Our experiments demonstrate the viability and improved efficiency of our approach compared to existing related work for a broad variety of retrieval scenarios.TopX ist eine Top-k Suchmaschine für Text und XML Daten. Im Gegensatz zu Boole\u27; schen Suchmaschinen terminiert TopX die Anfragebearbeitung, sobald die k besten Ergebnisobjekte im Hinblick auf eine mehrdimensionale Anfrage gefunden wurden. Die Hauptbeiträge dieser Arbeit teilen sich in vier Schwerpunkte basierend auf vorherigen Veröffentlichungen bei internationalen Konferenzen oder Workshops: • Top-k Anfragebearbeitung mit probabilistischen Garantien. • Zugriffsoptimierte Top-k Anfragebearbeitung. • Dynamische und selbstoptimierende, inkrementelle Anfrageexpansion für Top-k Anfragebearbeitung. • Effiziente Unterstützung für XML-Anfragen und Volltextsuche. Unsere Experimente bestätigen die Vielseitigkeit und gesteigerte Effizienz unserer Verfahren gegenüber existierenden, führenden Ansätzen für eine weite Bandbreite von Anwendungen in der Informationssuche

CiteSeerX

Universaar

Acronym

Reasoning & Querying – State of the Art

Author: Bry François
Furche Tim
Weiand Klara
Publication venue
Publication date: 31/08/2008
Field of study

Various query languages for Web and Semantic Web data, both for practical use and as an area of research in the scientific community, have emerged in recent years. At the same time, the broad adoption of the internet where keyword search is used in many applications, e.g. search engines, has familiarized casual users with using keyword queries to retrieve information on the internet. Unlike this easy-to-use querying, traditional query languages require knowledge of the language itself as well as of the data to be queried. Keyword-based query languages for XML and RDF bridge the gap between the two, aiming at enabling simple querying of semi-structured data, which is relevant e.g. in the context of the emerging Semantic Web. This article presents an overview of the field of keyword querying for XML and RDF

Open Access LMU

Using semantics in XML query processing

Author: WU HUAYU
Publication venue
Publication date: 21/03/2011
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

A Labeling DOM-Based Tree Walking Algorithm for Mapping XML Documents into Relational Databases

Author: Fath EL Rhman Seif El.Duola
Publication venue: UOFK
Publication date
Field of study

XML has emerged as the standard format for representing and exchanging data on the World Wide Web. For practical purposes, it is found to be critical to have efficient mechanisms to store and query XML data, as well as to exploit the full power of this new technology. Several researchers have proposed to use relational databases to store and query XML data. With the understanding the limitations of current approaches, this thesis aims to propose an algorithm for automatic mapping XML documents to RDBMS with XML-API as a database utility. The algorithm uses best fit auto mapping technique, and dynamic shredding, of a specified selected XML document type (datacentric, document-centric, and mixed documents).e. The propose algorithm use DOM(Data Object Model) as a warehouse and stack as a data structure to mapping the XML document into relational database and reconstructing the XML document from the relational database. The experiment study show that the algorithm mapping document and reconstructing it again well. Finally, the algorithm compare with other algorithms the result is good in time and efficiency, also the algorithm complexity is O(11n+2)

KhartoumSpace

An experimental study and evaluation of a new architecture for clinical decision support - integrating the openEHR specifications for the Electronic Health Record with Bayesian Networks

Author: Arikan SS
Publication venue: UCL (University College London)
Publication date: 28/06/2016
Field of study

Healthcare informatics still lacks wide-scale adoption of intelligent decision support methods, despite continuous increases in computing power and methodological advances in scalable computation and machine learning, over recent decades. The potential has long been recognised, as evidenced in the literature of the domain, which is extensively reviewed. The thesis identifies and explores key barriers to adoption of clinical decision support, through computational experiments encompassing a number of technical platforms. Building on previous research, it implements and tests a novel platform architecture capable of processing and reasoning with clinical data. The key components of this platform are the now widely implemented openEHR electronic health record specifications and Bayesian Belief Networks. Substantial software implementations are used to explore the integration of these components, guided and supplemented by input from clinician experts and using clinical data models derived in hospital settings at Moorfields Eye Hospital. Data quality and quantity issues are highlighted. Insights thus gained are used to design and build a novel graph-based representation and processing model for the clinical data, based on the openEHR specifications. The approach can be implemented using diverse modern database and platform technologies. Computational experiments with the platform, using data from two clinical domains – a preliminary study with published thyroid metabolism data and a substantial study of cataract surgery – explore fundamental barriers that must be overcome in intelligent healthcare systems developments for clinical settings. These have often been neglected, or misunderstood as implementation procedures of secondary importance. The results confirm that the methods developed have the potential to overcome a number of these barriers. The findings lead to proposals for improvements to the openEHR specifications, in the context of machine learning applications, and in particular for integrating them with Bayesian Networks. The thesis concludes with a roadmap for future research, building on progress and findings to date

UCL Discovery

Child Prime Label Approaches to Evaluate XML Structured Queries

Author: Alsubai Shtwai
Publication venue: 'University of Sheffield Conference Proceedings'
Publication date: 05/03/2018
Field of study

The adoption of the eXtensible Markup Language (XML) as the standard format to store and exchange semi-structure data has been gaining momentum. The growing number of XML documents leads to the need for appropriate XML querying algorithms which are able to retrieve XML data efficiently. Due to the importance of twig pattern matching in XML retrieval systems, finding all matching occurrences of a tree pattern query in an XML document is often considered as a specific task for XML databases as well as a core operation in XML query processing. This thesis presents a design and implementation of a new indexing technique, called the Child Prime Label (CPL) which exploits the property of prime numbers to identify Parent-Child (P-C) edges in twig pattern queries (TPQs) during query evaluation. The CPL approach can be incorporated efficiently within the existing labelling schemes. The major contributions of this thesis can be seen as a set of novel twig matching algorithms which apply the CPL approach and focus on reducing the overhead of storing useless elements and performing unnecessary computations during the output enumeration. The research presented here is the first to provide an efficient and general solution for TPQs containing ordering constraints and positional predicates specified by the XML query languages. To evaluate the CPL approaches, the holistic model was implemented as an experimental prototype in which the approaches proposed are compared against state-of-the-art holistic twig algorithms. Extensive performance studies on various real-world and artificial datasets were conducted to demonstrate the significant improvement of the CPL approaches over the previous indexing and querying methods. The experimental results demonstrate the validity and improvements of the new algorithms over other related methods on common various subclasses of TPQs. Moreover, the scalability tests reveal that the new algorithms are more suitable for processing large XML datasets

White Rose E-theses Online

Towards an effective processing of XML keyword query

Author: BAO ZHIFENG
Publication venue
Publication date: 29/11/2010
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Algorithms for XML stream processing : massive data, external memory and scalable performance

Author: Alrammal Muath
Publication venue: HAL CCSD
Publication date: 01/01/2011
Field of study

Many modern applications require processing of massive streams of XML data, creating difficult technical challenges. Among these, there is the design and implementation of applications to optimize the processing of XPath queries and to provide an accurate cost estimation for these queries processed on a massive steam of XML data. In this thesis, we propose a novel performance prediction model which a priori estimates the cost (in terms of space used and time spent) for any structural query belonging to Forward XPath. In doing so, we perform an experimental study to confirm the linear relationship between stream-processing and data-access resources. Therefore, we introduce a mathematical model (linear regression functions) to predict the cost for a given XPath query. Moreover, we introduce a new selectivity estimation technique. It consists of two elements. The first one is the path tree structure synopsis: a concise, accurate, and convenient summary of the structure of an XML document. The second one is the selectivity estimation algorithm: an efficient streamquerying algorithm to traverse the path tree synopsis for estimating the values of cost-parameters. Those parameters are used by the mathematical model to determine the cost of a given XPath query. We compare the performance of our model with existing approaches. Furthermore, we present a use case for an online stream-querying system. The system uses our performance predicate model to estimate the cost for a given XPath query in terms of time/memory. Moreover, it provides an accurate answer for the query's sender. This use case illustrates the practical advantages of performance management with our techniques.Plusieurs applications modernes nécessitent un traitement de flux massifs de données XML, cela crée de défis techniques. Parmi ces derniers, il y a la conception et la mise en ouvre d'outils pour optimiser le traitement des requêtes XPath et fournir une estimation précise des coûts de ces requêtes traitées sur un flux massif de données XML. Dans cette thèse, nous proposons un nouveau modèle de prédiction de performance qui estime a priori le coût (en termes d'espace utilisé et de temps écoulé) pour les requêtes structurelles de Forward XPath. Ce faisant, nous réalisons une étude expérimentale pour confirmer la relation linéaire entre le traitement de flux, et les ressources d'accès aux données. Par conséquent, nous présentons un modèle mathématique (fonctions de régression linéaire) pour prévoir le coût d'une requête XPath donnée. En outre, nous présentons une technique nouvelle d'estimation de sélectivité. Elle se compose de deux éléments. Le premier est le résumé path tree: une présentation concise et précise de la structure d'un document XML. Le second est l'algorithme d'estimation de sélectivité: un algorithme efficace de flux pour traverser le synopsis path tree pour estimer les valeurs des paramètres de coût. Ces paramètres sont utilisés par le modèle mathématique pour déterminer le coût d'une requête XPath donnée. Nous comparons les performances de notre modèle avec les approches existantes. De plus, nous présentons un cas d'utilisation d'un système en ligne appelé "online stream-querying system". Le système utilise notre modèle de prédiction de performance pour estimer le coût (en termes de temps / mémoire) d'une requête XPath donnée. En outre, il fournit une réponse précise à l'auteur de la requête. Ce cas d'utilisation illustre les avantages pratiques de gestion de performance avec nos technique

IDEAS-1997-2021-Final-Programs

Author: Desai Bipin C.
Publication venue
Publication date: 31/08/2021
Field of study

This document records the final program for each of the 26 meetings of the International Database and Engineering Application Symposium from 1997 through 2021. These meetings were organized in various locations on three continents. Most of the papers published during these years are in the digital libraries of IEEE(1997-2007) or ACM(2008-2021)

Concordia University Research Repository

간결한 자료구조를 활용한 반구조화된 문서 형식들의 공간 효율적 표현법

Author: 이준희
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2021. 2. Srinivasa Rao Satti.Numerous big data are generated from a plethora of sources. Most of the data stored as files contain a non-fixed type of schema, so that the files are suitable to be maintained as semi-structured document formats. A number of those formats, such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), and YAML (YAML Ain't Markup Language) are suggested to sustain hierarchy in the original corpora of data. Several data models structuring the gathered data - including RDF (Resource Description Framework) - depend on the semi-structured document formats to be serialized and transferred for future processing. Since the semi-structured document formats focus on readability and verbosity, redundant space is required to organize and maintain the document. Even though general-purpose compression schemes are widely used to compact the documents, applying those algorithms hinder future handling of the corpora, owing to loss of internal structures. The area of succinct data structures is widely investigated and researched in theory, to provide answers to the queries while the encoded data occupy space close to the information-theoretic lower bound. Bit vectors and trees are the notable succinct data structures. Nevertheless, there were few attempts to apply the idea of succinct data structures to represent the semi-structured documents in space-efficient manner. In this dissertation we propose a unified, space-efficient representation of various semi-structured document formats. The core functionality of this representation is its compactness and query-ability derived from enriched functions of succinct data structures. Incorporation of (a) bit indexed arrays, (b) succinct ordinal trees, and (c) compression techniques engineers the compact representation. We implement this representation in practice, and show by experiments that construction of this representation decreases the disk usage by up to 60% while occupying 90% less RAM. We also allow processing a document in partial manner, to allow processing of larger corpus of big data even in the constrained environment. In parallel to establishing the aforementioned compact semi-structured document representation, we provide and reinforce some of the existing compression schemes in this dissertation. We first suggest an idea to encode an array of integers that is not necessarily sorted. This compaction scheme improves upon the existing universal code systems, by assistance of succinct bit vector structure. We show that our suggested algorithm reduces space usage by up to 44% while consuming 15% less time than the original code system, while the algorithm additionally supports random access of elements upon the encoded array. We also reinforce the SBH bitmap index compression algorithm. The main strength of this scheme is the use of intermediate super-bucket during operations, giving better performance on querying through a combination of compressed bitmap indexes. Inspired from splits done during the intermediate process of the SBH algorithm, we give an improved compression mechanism supporting parallelism that could be utilized in both CPUs and GPUs. We show by experiments that this CPU parallel processing optimization diminishes compression and decompression times by up to 38% in a 4-core machine without modifying the bitmap compressed form. For GPUs, the new algorithm gives 48% faster query processing time in the experiments, compared to the previous existing bitmap index compression schemes.셀 수 없는 빅 데이터가 다양한 원본로부터 생성되고 있다. 이들 데이터의 대부분은 고정되지 않은 종류의 스키마를 포함한 파일 형태로 저장되는데, 이로 인하여 반구조화된 문서 형식을 이용하여 파일을 유지하는 것이 적합하다. XML, JSON 및 YAML과 같은 종류의 반구조화된 문서 형식이 데이터에 내재하는 구조를 유지하기 위하여 제안되었다. 수집된 데이터를 구조화하는 RDF와 같은 여러 데이터 모델들은 사후 처리를 위한 저장 및 전송을 위하여 반구조화된 문서 형식에 의존한다. 반구조화된 문서 형식은 가독성과 다변성에 집중하기 때문에, 문서를 구조화하고 유지하기 위하여 추가적인 공간을 필요로 한다. 문서를 압축시키기 위하여 일반적인 압축 기법들이 널리 사용되고 있으나, 이들 기법들을 적용하게 되면 문서의 내부 구조의 손실로 인하여 데이터의 사후 처리가 어렵게 된다. 데이터를 정보이론적 하한에 가까운 공간만을 사용하여 저장을 가능하게 하면서 질의에 대한 응답을 제공하는 간결한 자료구조는 이론적으로 널리 연구되고 있는 분야이다. 비트열과 트리가 널리 알려진 간결한 자료구조들이다. 그러나 반구조화된 문서들을 저장하는 데 간결한 자료구조의 아이디어를 적용한 연구는 거의 진행되지 않았다. 본 학위논문을 통해 우리는 다양한 종류의 반구조화된 문서 형식을 통일되게 표현하는 공간 효율적 표현법을 제시한다. 이 기법의 주요한 기능은 간결한 자료구조가 강점으로 가지는 특성에 기반한 간결성과 질의 가능성이다. 비트열로 인덱싱된 배열, 간결한 순서 있는 트리 및 다양한 압축 기법을 통합하여 해당 표현법을 고안하였다. 이 기법은 실재적으로 구현되었고, 실험을 통하여 이 기법을 적용한 반구조화된 문서들은 최대 60% 적은 디스크 공간과 90% 적은 메모리 공간을 통해 표현될 수 있다는 것을 보인다. 더불어 본 학위논문에서 반구조화된 문서들은 분할적으로 표현이 가능함을 보이고, 이를 통하여 제한된 환경에서도 빅 데이터를 표현한 문서들을 처리할 수 있다는 것을 보인다. 앞서 언급한 공간 효율적 반구조화된 문서 표현법을 구축함과 동시에, 본 학위논문에서 이미 존재하는 압축 기법 중 일부를 추가적으로 개선한다. 첫째로, 본 학위논문에서는 정렬 여부에 관계없는 정수 배열을 부호화하는 아이디어를 제시한다. 이 기법은 이미 존재하는 범용 코드 시스템을 개선한 형태로, 간결한 비트열 자료구조를 이용한다. 제안된 알고리즘은 기존 범용 코드 시스템에 비해 최대 44\% 적은 공간을 사용할 뿐만 아니라 15\% 적은 부호화 시간을 필요로 하며, 기존 시스템에서 제공하지 않는 부호화된 배열에서의 임의 접근을 지원한다. 또한 본 학위논문에서는 비트맵 인덱스 압축에 사용되는 SBH 알고리즘을 개선시킨다. 해당 기법의 주된 강점은 부호화와 복호화 진행 시 중간 매개인 슈퍼버켓을 사용함으로써 여러 압축된 비트맵 인덱스에 대한 질의 성능을 개선시키는 것이다. 위 압축 알고리즘의 중간 과정에서 진행되는 분할에서 영감을 얻어, 본 학위논문에서 CPU 및 GPU에 적용 가능한 개선된 병렬화 압축 매커니즘을 제시한다. 실험을 통해 CPU 병렬 최적화가 이루어진 알고리즘은 압축된 형태의 변형 없이 4코어 컴퓨터에서 최대 38\%의 압축 및 해제 시간을 감소시킨다는 것을 보인다. GPU 병렬 최적화는 기존에 존재하는 GPU 비트맵 압축 기법에 비해 48\% 빠른 질의 처리 시간을 필요로 함을 확인한다.Chapter 1 Introduction 1 1.1 Contribution 3 1.2 Organization 5 Chapter 2 Background 6 2.1 Model of Computation 6 2.2 Succinct Data Structures 7 Chapter 3 Space-efficient Representation of Integer Arrays 9 3.1 Introduction 9 3.2 Preliminaries 10 3.2.1 Universal Code System 10 3.2.2 Bit Vector 13 3.3 Algorithm Description 13 3.3.1 Main Principle 14 3.3.2 Optimization in the Implementation 16 3.4 Experimental Results 16 Chapter 4 Space-efficient Parallel Compressed Bitmap Index Processing 19 4.1 Introduction 19 4.2 Related Work 23 4.2.1 Byte-aligned Bitmap Code (BBC) 24 4.2.2 Word-Aligned Hybrid (WAH) 27 4.2.3 WAH-derived Algorithms 28 4.2.4 GPU-based WAH Algorithms 31 4.2.5 Super Byte-aligned Hybrid (SBH) 33 4.3 Parallelizing SBH 38 4.3.1 CPU Parallelism 38 4.3.2 GPU Parallelism 39 4.4 Experimental Results 40 4.4.1 Plain Version 41 4.4.2 Parallelized Version 46 4.4.3 Summary 49 Chapter 5 Space-efficient Representation of Semi-structured Document Formats 50 5.1 Preliminaries 50 5.1.1 Semi-structured Document Formats 50 5.1.2 Resource Description Framework 57 5.1.3 Succinct Ordinal Tree Representations 60 5.1.4 String Compression Schemes 64 5.2 Representation 66 5.2.1 Bit String Indexed Array 67 5.2.2 Main Structure 68 5.2.3 Single Document as a Collection of Chunks 72 5.2.4 Supporting Queries 73 5.3 Experimental Results 75 5.3.1 Datasets 76 5.3.2 Construction Time 78 5.3.3 RAM Usage during Construction 80 5.3.4 Disk Usage and Serialization Time 83 5.3.5 Chunk Division 83 5.3.6 String Compression 88 5.3.7 Query Time 89 Chapter 6 Conclusion 94 Bibliography 96 요약 109 Acknowledgements 111Docto

SNU Open Repository and Archive