16 research outputs found

    TopX : efficient and versatile top-k query processing for text, structured, and semistructured data

    Get PDF
    TopX is a top-k retrieval engine for text and XML data. Unlike Boolean engines, it stops query processing as soon as it can safely determine the k top-ranked result objects according to a monotonous score aggregation function with respect to a multidimensional query. The main contributions of the thesis unfold into four main points, confirmed by previous publications at international conferences or workshops: โ€ข Top-k query processing with probabilistic guarantees. โ€ข Index-access optimized top-k query processing. โ€ข Dynamic and self-tuning, incremental query expansion for top-k query processing. โ€ข Efficient support for ranked XML retrieval and full-text search. Our experiments demonstrate the viability and improved efficiency of our approach compared to existing related work for a broad variety of retrieval scenarios.TopX ist eine Top-k Suchmaschine fรผr Text und XML Daten. Im Gegensatz zu Boole\u27; schen Suchmaschinen terminiert TopX die Anfragebearbeitung, sobald die k besten Ergebnisobjekte im Hinblick auf eine mehrdimensionale Anfrage gefunden wurden. Die Hauptbeitrรคge dieser Arbeit teilen sich in vier Schwerpunkte basierend auf vorherigen Verรถffentlichungen bei internationalen Konferenzen oder Workshops: โ€ข Top-k Anfragebearbeitung mit probabilistischen Garantien. โ€ข Zugriffsoptimierte Top-k Anfragebearbeitung. โ€ข Dynamische und selbstoptimierende, inkrementelle Anfrageexpansion fรผr Top-k Anfragebearbeitung. โ€ข Effiziente Unterstรผtzung fรผr XML-Anfragen und Volltextsuche. Unsere Experimente bestรคtigen die Vielseitigkeit und gesteigerte Effizienz unserer Verfahren gegenรผber existierenden, fรผhrenden Ansรคtzen fรผr eine weite Bandbreite von Anwendungen in der Informationssuche

    Reasoning & Querying โ€“ State of the Art

    Get PDF
    Various query languages for Web and Semantic Web data, both for practical use and as an area of research in the scientific community, have emerged in recent years. At the same time, the broad adoption of the internet where keyword search is used in many applications, e.g. search engines, has familiarized casual users with using keyword queries to retrieve information on the internet. Unlike this easy-to-use querying, traditional query languages require knowledge of the language itself as well as of the data to be queried. Keyword-based query languages for XML and RDF bridge the gap between the two, aiming at enabling simple querying of semi-structured data, which is relevant e.g. in the context of the emerging Semantic Web. This article presents an overview of the field of keyword querying for XML and RDF

    Using semantics in XML query processing

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    A Labeling DOM-Based Tree Walking Algorithm for Mapping XML Documents into Relational Databases

    Get PDF
    XML has emerged as the standard format for representing and exchanging data on the World Wide Web. For practical purposes, it is found to be critical to have efficient mechanisms to store and query XML data, as well as to exploit the full power of this new technology. Several researchers have proposed to use relational databases to store and query XML data. With the understanding the limitations of current approaches, this thesis aims to propose an algorithm for automatic mapping XML documents to RDBMS with XML-API as a database utility. The algorithm uses best fit auto mapping technique, and dynamic shredding, of a specified selected XML document type (datacentric, document-centric, and mixed documents).e. The propose algorithm use DOM(Data Object Model) as a warehouse and stack as a data structure to mapping the XML document into relational database and reconstructing the XML document from the relational database. The experiment study show that the algorithm mapping document and reconstructing it again well. Finally, the algorithm compare with other algorithms the result is good in time and efficiency, also the algorithm complexity is O(11n+2)

    An experimental study and evaluation of a new architecture for clinical decision support - integrating the openEHR specifications for the Electronic Health Record with Bayesian Networks

    Get PDF
    Healthcare informatics still lacks wide-scale adoption of intelligent decision support methods, despite continuous increases in computing power and methodological advances in scalable computation and machine learning, over recent decades. The potential has long been recognised, as evidenced in the literature of the domain, which is extensively reviewed. The thesis identifies and explores key barriers to adoption of clinical decision support, through computational experiments encompassing a number of technical platforms. Building on previous research, it implements and tests a novel platform architecture capable of processing and reasoning with clinical data. The key components of this platform are the now widely implemented openEHR electronic health record specifications and Bayesian Belief Networks. Substantial software implementations are used to explore the integration of these components, guided and supplemented by input from clinician experts and using clinical data models derived in hospital settings at Moorfields Eye Hospital. Data quality and quantity issues are highlighted. Insights thus gained are used to design and build a novel graph-based representation and processing model for the clinical data, based on the openEHR specifications. The approach can be implemented using diverse modern database and platform technologies. Computational experiments with the platform, using data from two clinical domains โ€“ a preliminary study with published thyroid metabolism data and a substantial study of cataract surgery โ€“ explore fundamental barriers that must be overcome in intelligent healthcare systems developments for clinical settings. These have often been neglected, or misunderstood as implementation procedures of secondary importance. The results confirm that the methods developed have the potential to overcome a number of these barriers. The findings lead to proposals for improvements to the openEHR specifications, in the context of machine learning applications, and in particular for integrating them with Bayesian Networks. The thesis concludes with a roadmap for future research, building on progress and findings to date

    Child Prime Label Approaches to Evaluate XML Structured Queries

    Get PDF
    The adoption of the eXtensible Markup Language (XML) as the standard format to store and exchange semi-structure data has been gaining momentum. The growing number of XML documents leads to the need for appropriate XML querying algorithms which are able to retrieve XML data efficiently. Due to the importance of twig pattern matching in XML retrieval systems, finding all matching occurrences of a tree pattern query in an XML document is often considered as a specific task for XML databases as well as a core operation in XML query processing. This thesis presents a design and implementation of a new indexing technique, called the Child Prime Label (CPL) which exploits the property of prime numbers to identify Parent-Child (P-C) edges in twig pattern queries (TPQs) during query evaluation. The CPL approach can be incorporated efficiently within the existing labelling schemes. The major contributions of this thesis can be seen as a set of novel twig matching algorithms which apply the CPL approach and focus on reducing the overhead of storing useless elements and performing unnecessary computations during the output enumeration. The research presented here is the first to provide an efficient and general solution for TPQs containing ordering constraints and positional predicates specified by the XML query languages. To evaluate the CPL approaches, the holistic model was implemented as an experimental prototype in which the approaches proposed are compared against state-of-the-art holistic twig algorithms. Extensive performance studies on various real-world and artificial datasets were conducted to demonstrate the significant improvement of the CPL approaches over the previous indexing and querying methods. The experimental results demonstrate the validity and improvements of the new algorithms over other related methods on common various subclasses of TPQs. Moreover, the scalability tests reveal that the new algorithms are more suitable for processing large XML datasets

    Towards an effective processing of XML keyword query

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Algorithms for XML stream processing : massive data, external memory and scalable performance

    Get PDF
    Many modern applications require processing of massive streams of XML data, creating difficult technical challenges. Among these, there is the design and implementation of applications to optimize the processing of XPath queries and to provide an accurate cost estimation for these queries processed on a massive steam of XML data. In this thesis, we propose a novel performance prediction model which a priori estimates the cost (in terms of space used and time spent) for any structural query belonging to Forward XPath. In doing so, we perform an experimental study to confirm the linear relationship between stream-processing and data-access resources. Therefore, we introduce a mathematical model (linear regression functions) to predict the cost for a given XPath query. Moreover, we introduce a new selectivity estimation technique. It consists of two elements. The first one is the path tree structure synopsis: a concise, accurate, and convenient summary of the structure of an XML document. The second one is the selectivity estimation algorithm: an efficient streamquerying algorithm to traverse the path tree synopsis for estimating the values of cost-parameters. Those parameters are used by the mathematical model to determine the cost of a given XPath query. We compare the performance of our model with existing approaches. Furthermore, we present a use case for an online stream-querying system. The system uses our performance predicate model to estimate the cost for a given XPath query in terms of time/memory. Moreover, it provides an accurate answer for the query's sender. This use case illustrates the practical advantages of performance management with our techniques.Plusieurs applications modernes nรฉcessitent un traitement de flux massifs de donnรฉes XML, cela crรฉe de dรฉfis techniques. Parmi ces derniers, il y a la conception et la mise en ouvre d'outils pour optimiser le traitement des requรชtes XPath et fournir une estimation prรฉcise des coรปts de ces requรชtes traitรฉes sur un flux massif de donnรฉes XML. Dans cette thรจse, nous proposons un nouveau modรจle de prรฉdiction de performance qui estime a priori le coรปt (en termes d'espace utilisรฉ et de temps รฉcoulรฉ) pour les requรชtes structurelles de Forward XPath. Ce faisant, nous rรฉalisons une รฉtude expรฉrimentale pour confirmer la relation linรฉaire entre le traitement de flux, et les ressources d'accรจs aux donnรฉes. Par consรฉquent, nous prรฉsentons un modรจle mathรฉmatique (fonctions de rรฉgression linรฉaire) pour prรฉvoir le coรปt d'une requรชte XPath donnรฉe. En outre, nous prรฉsentons une technique nouvelle d'estimation de sรฉlectivitรฉ. Elle se compose de deux รฉlรฉments. Le premier est le rรฉsumรฉ path tree: une prรฉsentation concise et prรฉcise de la structure d'un document XML. Le second est l'algorithme d'estimation de sรฉlectivitรฉ: un algorithme efficace de flux pour traverser le synopsis path tree pour estimer les valeurs des paramรจtres de coรปt. Ces paramรจtres sont utilisรฉs par le modรจle mathรฉmatique pour dรฉterminer le coรปt d'une requรชte XPath donnรฉe. Nous comparons les performances de notre modรจle avec les approches existantes. De plus, nous prรฉsentons un cas d'utilisation d'un systรจme en ligne appelรฉ "online stream-querying system". Le systรจme utilise notre modรจle de prรฉdiction de performance pour estimer le coรปt (en termes de temps / mรฉmoire) d'une requรชte XPath donnรฉe. En outre, il fournit une rรฉponse prรฉcise ร  l'auteur de la requรชte. Ce cas d'utilisation illustre les avantages pratiques de gestion de performance avec nos technique

    IDEAS-1997-2021-Final-Programs

    Get PDF
    This document records the final program for each of the 26 meetings of the International Database and Engineering Application Symposium from 1997 through 2021. These meetings were organized in various locations on three continents. Most of the papers published during these years are in the digital libraries of IEEE(1997-2007) or ACM(2008-2021)

    ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•œ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹๋“ค์˜ ๊ณต๊ฐ„ ํšจ์œจ์  ํ‘œํ˜„๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. Srinivasa Rao Satti.Numerous big data are generated from a plethora of sources. Most of the data stored as files contain a non-fixed type of schema, so that the files are suitable to be maintained as semi-structured document formats. A number of those formats, such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), and YAML (YAML Ain't Markup Language) are suggested to sustain hierarchy in the original corpora of data. Several data models structuring the gathered data - including RDF (Resource Description Framework) - depend on the semi-structured document formats to be serialized and transferred for future processing. Since the semi-structured document formats focus on readability and verbosity, redundant space is required to organize and maintain the document. Even though general-purpose compression schemes are widely used to compact the documents, applying those algorithms hinder future handling of the corpora, owing to loss of internal structures. The area of succinct data structures is widely investigated and researched in theory, to provide answers to the queries while the encoded data occupy space close to the information-theoretic lower bound. Bit vectors and trees are the notable succinct data structures. Nevertheless, there were few attempts to apply the idea of succinct data structures to represent the semi-structured documents in space-efficient manner. In this dissertation we propose a unified, space-efficient representation of various semi-structured document formats. The core functionality of this representation is its compactness and query-ability derived from enriched functions of succinct data structures. Incorporation of (a) bit indexed arrays, (b) succinct ordinal trees, and (c) compression techniques engineers the compact representation. We implement this representation in practice, and show by experiments that construction of this representation decreases the disk usage by up to 60% while occupying 90% less RAM. We also allow processing a document in partial manner, to allow processing of larger corpus of big data even in the constrained environment. In parallel to establishing the aforementioned compact semi-structured document representation, we provide and reinforce some of the existing compression schemes in this dissertation. We first suggest an idea to encode an array of integers that is not necessarily sorted. This compaction scheme improves upon the existing universal code systems, by assistance of succinct bit vector structure. We show that our suggested algorithm reduces space usage by up to 44% while consuming 15% less time than the original code system, while the algorithm additionally supports random access of elements upon the encoded array. We also reinforce the SBH bitmap index compression algorithm. The main strength of this scheme is the use of intermediate super-bucket during operations, giving better performance on querying through a combination of compressed bitmap indexes. Inspired from splits done during the intermediate process of the SBH algorithm, we give an improved compression mechanism supporting parallelism that could be utilized in both CPUs and GPUs. We show by experiments that this CPU parallel processing optimization diminishes compression and decompression times by up to 38% in a 4-core machine without modifying the bitmap compressed form. For GPUs, the new algorithm gives 48% faster query processing time in the experiments, compared to the previous existing bitmap index compression schemes.์…€ ์ˆ˜ ์—†๋Š” ๋น… ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค์–‘ํ•œ ์›๋ณธ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋˜๊ณ  ์žˆ๋‹ค. ์ด๋“ค ๋ฐ์ดํ„ฐ์˜ ๋Œ€๋ถ€๋ถ„์€ ๊ณ ์ •๋˜์ง€ ์•Š์€ ์ข…๋ฅ˜์˜ ์Šคํ‚ค๋งˆ๋ฅผ ํฌํ•จํ•œ ํŒŒ์ผ ํ˜•ํƒœ๋กœ ์ €์žฅ๋˜๋Š”๋ฐ, ์ด๋กœ ์ธํ•˜์—ฌ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์„ ์ด์šฉํ•˜์—ฌ ํŒŒ์ผ์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ์ ํ•ฉํ•˜๋‹ค. XML, JSON ๋ฐ YAML๊ณผ ๊ฐ™์€ ์ข…๋ฅ˜์˜ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์ด ๋ฐ์ดํ„ฐ์— ๋‚ด์žฌํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์กฐํ™”ํ•˜๋Š” RDF์™€ ๊ฐ™์€ ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ๋ชจ๋ธ๋“ค์€ ์‚ฌํ›„ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์ €์žฅ ๋ฐ ์ „์†ก์„ ์œ„ํ•˜์—ฌ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์— ์˜์กดํ•œ๋‹ค. ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์€ ๊ฐ€๋…์„ฑ๊ณผ ๋‹ค๋ณ€์„ฑ์— ์ง‘์ค‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋ฌธ์„œ๋ฅผ ๊ตฌ์กฐํ™”ํ•˜๊ณ  ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ถ”๊ฐ€์ ์ธ ๊ณต๊ฐ„์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ๋ฌธ์„œ๋ฅผ ์••์ถ•์‹œํ‚ค๊ธฐ ์œ„ํ•˜์—ฌ ์ผ๋ฐ˜์ ์ธ ์••์ถ• ๊ธฐ๋ฒ•๋“ค์ด ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ์œผ๋‚˜, ์ด๋“ค ๊ธฐ๋ฒ•๋“ค์„ ์ ์šฉํ•˜๊ฒŒ ๋˜๋ฉด ๋ฌธ์„œ์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ์˜ ์†์‹ค๋กœ ์ธํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ์‚ฌํ›„ ์ฒ˜๋ฆฌ๊ฐ€ ์–ด๋ ต๊ฒŒ ๋œ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ์ •๋ณด์ด๋ก ์  ํ•˜ํ•œ์— ๊ฐ€๊นŒ์šด ๊ณต๊ฐ„๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ €์žฅ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ฉด์„œ ์งˆ์˜์— ๋Œ€ํ•œ ์‘๋‹ต์„ ์ œ๊ณตํ•˜๋Š” ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๋Š” ์ด๋ก ์ ์œผ๋กœ ๋„๋ฆฌ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋Š” ๋ถ„์•ผ์ด๋‹ค. ๋น„ํŠธ์—ด๊ณผ ํŠธ๋ฆฌ๊ฐ€ ๋„๋ฆฌ ์•Œ๋ ค์ง„ ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๋“ค์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋“ค์„ ์ €์žฅํ•˜๋Š” ๋ฐ ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ์˜ ์•„์ด๋””์–ด๋ฅผ ์ ์šฉํ•œ ์—ฐ๊ตฌ๋Š” ๊ฑฐ์˜ ์ง„ํ–‰๋˜์ง€ ์•Š์•˜๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์„ ํ†ตํ•ด ์šฐ๋ฆฌ๋Š” ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์„ ํ†ต์ผ๋˜๊ฒŒ ํ‘œํ˜„ํ•˜๋Š” ๊ณต๊ฐ„ ํšจ์œจ์  ํ‘œํ˜„๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ์ด ๊ธฐ๋ฒ•์˜ ์ฃผ์š”ํ•œ ๊ธฐ๋Šฅ์€ ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๊ฐ€ ๊ฐ•์ ์œผ๋กœ ๊ฐ€์ง€๋Š” ํŠน์„ฑ์— ๊ธฐ๋ฐ˜ํ•œ ๊ฐ„๊ฒฐ์„ฑ๊ณผ ์งˆ์˜ ๊ฐ€๋Šฅ์„ฑ์ด๋‹ค. ๋น„ํŠธ์—ด๋กœ ์ธ๋ฑ์‹ฑ๋œ ๋ฐฐ์—ด, ๊ฐ„๊ฒฐํ•œ ์ˆœ์„œ ์žˆ๋Š” ํŠธ๋ฆฌ ๋ฐ ๋‹ค์–‘ํ•œ ์••์ถ• ๊ธฐ๋ฒ•์„ ํ†ตํ•ฉํ•˜์—ฌ ํ•ด๋‹น ํ‘œํ˜„๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ์ด ๊ธฐ๋ฒ•์€ ์‹ค์žฌ์ ์œผ๋กœ ๊ตฌํ˜„๋˜์—ˆ๊ณ , ์‹คํ—˜์„ ํ†ตํ•˜์—ฌ ์ด ๊ธฐ๋ฒ•์„ ์ ์šฉํ•œ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋“ค์€ ์ตœ๋Œ€ 60% ์ ์€ ๋””์Šคํฌ ๊ณต๊ฐ„๊ณผ 90% ์ ์€ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์„ ํ†ตํ•ด ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค. ๋”๋ถˆ์–ด ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋“ค์€ ๋ถ„ํ• ์ ์œผ๋กœ ํ‘œํ˜„์ด ๊ฐ€๋Šฅํ•จ์„ ๋ณด์ด๊ณ , ์ด๋ฅผ ํ†ตํ•˜์—ฌ ์ œํ•œ๋œ ํ™˜๊ฒฝ์—์„œ๋„ ๋น… ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œํ˜„ํ•œ ๋ฌธ์„œ๋“ค์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค. ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๊ณต๊ฐ„ ํšจ์œจ์  ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ‘œํ˜„๋ฒ•์„ ๊ตฌ์ถ•ํ•จ๊ณผ ๋™์‹œ์—, ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ ์ด๋ฏธ ์กด์žฌํ•˜๋Š” ์••์ถ• ๊ธฐ๋ฒ• ์ค‘ ์ผ๋ถ€๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ๊ฐœ์„ ํ•œ๋‹ค. ์ฒซ์งธ๋กœ, ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ์ •๋ ฌ ์—ฌ๋ถ€์— ๊ด€๊ณ„์—†๋Š” ์ •์ˆ˜ ๋ฐฐ์—ด์„ ๋ถ€ํ˜ธํ™”ํ•˜๋Š” ์•„์ด๋””์–ด๋ฅผ ์ œ์‹œํ•œ๋‹ค. ์ด ๊ธฐ๋ฒ•์€ ์ด๋ฏธ ์กด์žฌํ•˜๋Š” ๋ฒ”์šฉ ์ฝ”๋“œ ์‹œ์Šคํ…œ์„ ๊ฐœ์„ ํ•œ ํ˜•ํƒœ๋กœ, ๊ฐ„๊ฒฐํ•œ ๋น„ํŠธ์—ด ์ž๋ฃŒ๊ตฌ์กฐ๋ฅผ ์ด์šฉํ•œ๋‹ค. ์ œ์•ˆ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ธฐ์กด ๋ฒ”์šฉ ์ฝ”๋“œ ์‹œ์Šคํ…œ์— ๋น„ํ•ด ์ตœ๋Œ€ 44\% ์ ์€ ๊ณต๊ฐ„์„ ์‚ฌ์šฉํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ 15\% ์ ์€ ๋ถ€ํ˜ธํ™” ์‹œ๊ฐ„์„ ํ•„์š”๋กœ ํ•˜๋ฉฐ, ๊ธฐ์กด ์‹œ์Šคํ…œ์—์„œ ์ œ๊ณตํ•˜์ง€ ์•Š๋Š” ๋ถ€ํ˜ธํ™”๋œ ๋ฐฐ์—ด์—์„œ์˜ ์ž„์˜ ์ ‘๊ทผ์„ ์ง€์›ํ•œ๋‹ค. ๋˜ํ•œ ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ๋น„ํŠธ๋งต ์ธ๋ฑ์Šค ์••์ถ•์— ์‚ฌ์šฉ๋˜๋Š” SBH ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐœ์„ ์‹œํ‚จ๋‹ค. ํ•ด๋‹น ๊ธฐ๋ฒ•์˜ ์ฃผ๋œ ๊ฐ•์ ์€ ๋ถ€ํ˜ธํ™”์™€ ๋ณตํ˜ธํ™” ์ง„ํ–‰ ์‹œ ์ค‘๊ฐ„ ๋งค๊ฐœ์ธ ์Šˆํผ๋ฒ„์ผ“์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์—ฌ๋Ÿฌ ์••์ถ•๋œ ๋น„ํŠธ๋งต ์ธ๋ฑ์Šค์— ๋Œ€ํ•œ ์งˆ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค. ์œ„ ์••์ถ• ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ค‘๊ฐ„ ๊ณผ์ •์—์„œ ์ง„ํ–‰๋˜๋Š” ๋ถ„ํ• ์—์„œ ์˜๊ฐ์„ ์–ป์–ด, ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ CPU ๋ฐ GPU์— ์ ์šฉ ๊ฐ€๋Šฅํ•œ ๊ฐœ์„ ๋œ ๋ณ‘๋ ฌํ™” ์••์ถ• ๋งค์ปค๋‹ˆ์ฆ˜์„ ์ œ์‹œํ•œ๋‹ค. ์‹คํ—˜์„ ํ†ตํ•ด CPU ๋ณ‘๋ ฌ ์ตœ์ ํ™”๊ฐ€ ์ด๋ฃจ์–ด์ง„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์••์ถ•๋œ ํ˜•ํƒœ์˜ ๋ณ€ํ˜• ์—†์ด 4์ฝ”์–ด ์ปดํ“จํ„ฐ์—์„œ ์ตœ๋Œ€ 38\%์˜ ์••์ถ• ๋ฐ ํ•ด์ œ ์‹œ๊ฐ„์„ ๊ฐ์†Œ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค. GPU ๋ณ‘๋ ฌ ์ตœ์ ํ™”๋Š” ๊ธฐ์กด์— ์กด์žฌํ•˜๋Š” GPU ๋น„ํŠธ๋งต ์••์ถ• ๊ธฐ๋ฒ•์— ๋น„ํ•ด 48\% ๋น ๋ฅธ ์งˆ์˜ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ ํ•„์š”๋กœ ํ•จ์„ ํ™•์ธํ•œ๋‹ค.Chapter 1 Introduction 1 1.1 Contribution 3 1.2 Organization 5 Chapter 2 Background 6 2.1 Model of Computation 6 2.2 Succinct Data Structures 7 Chapter 3 Space-efficient Representation of Integer Arrays 9 3.1 Introduction 9 3.2 Preliminaries 10 3.2.1 Universal Code System 10 3.2.2 Bit Vector 13 3.3 Algorithm Description 13 3.3.1 Main Principle 14 3.3.2 Optimization in the Implementation 16 3.4 Experimental Results 16 Chapter 4 Space-efficient Parallel Compressed Bitmap Index Processing 19 4.1 Introduction 19 4.2 Related Work 23 4.2.1 Byte-aligned Bitmap Code (BBC) 24 4.2.2 Word-Aligned Hybrid (WAH) 27 4.2.3 WAH-derived Algorithms 28 4.2.4 GPU-based WAH Algorithms 31 4.2.5 Super Byte-aligned Hybrid (SBH) 33 4.3 Parallelizing SBH 38 4.3.1 CPU Parallelism 38 4.3.2 GPU Parallelism 39 4.4 Experimental Results 40 4.4.1 Plain Version 41 4.4.2 Parallelized Version 46 4.4.3 Summary 49 Chapter 5 Space-efficient Representation of Semi-structured Document Formats 50 5.1 Preliminaries 50 5.1.1 Semi-structured Document Formats 50 5.1.2 Resource Description Framework 57 5.1.3 Succinct Ordinal Tree Representations 60 5.1.4 String Compression Schemes 64 5.2 Representation 66 5.2.1 Bit String Indexed Array 67 5.2.2 Main Structure 68 5.2.3 Single Document as a Collection of Chunks 72 5.2.4 Supporting Queries 73 5.3 Experimental Results 75 5.3.1 Datasets 76 5.3.2 Construction Time 78 5.3.3 RAM Usage during Construction 80 5.3.4 Disk Usage and Serialization Time 83 5.3.5 Chunk Division 83 5.3.6 String Compression 88 5.3.7 Query Time 89 Chapter 6 Conclusion 94 Bibliography 96 ์š”์•ฝ 109 Acknowledgements 111Docto
    corecore