364 research outputs found

    Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

    Get PDF
    Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n over an alphabet of size {\sigma} on a RAM machine with words of w = {\Omega}(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length ` in almost-optimal time O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log log_w(n/r + sigma)

    Optimal-Time Text Indexing in BWT-runs Bounded Space

    Full text link
    Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is rr, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r)O(r) space and was able to efficiently count the number of occurrences of a pattern of length mm in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of rr. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occocc occurrences efficiently within O(r)O(r) space (in loglogarithmic time each), and reaching optimal time O(m+occ)O(m+occ) within O(rlogโก(n/r))O(r\log(n/r)) space, on a RAM machine of w=ฮฉ(logโกn)w=\Omega(\log n) bits. Within O(rlogโก(n/r))O(r\log (n/r)) space, our index can also count in optimal time O(m)O(m). Raising the space to O(rwlogโกฯƒ(n/r))O(r w\log_\sigma(n/r)), we support count and locate in O(mlogโก(ฯƒ)/w)O(m\log(\sigma)/w) and O(mlogโก(ฯƒ)/w+occ)O(m\log(\sigma)/w+occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(rlogโก(n/r))O(r\log(n/r)) space that replaces the text and extracts any text substring of length โ„“\ell in almost-optimal time O(logโก(n/r)+โ„“logโก(ฯƒ)/w)O(\log(n/r)+\ell\log(\sigma)/w). (...continues...

    Compressed Text Indexes:From Theory to Practice!

    Full text link
    A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. The goal of this paper is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective testbeds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel and exciting technology

    Compressed Full-Text Indexes for Highly Repetitive Collections

    Get PDF
    This thesis studies problems related to compressed full-text indexes. A full-text index is a data structure for indexing textual (sequence) data, so that the occurrences of any query string in the data can be found efficiently. While most full-text indexes require much more space than the sequences they index, recent compressed indexes have overcome this limitation. These compressed indexes combine a compressed representation of the index with some extra information that allows decompressing any part of the data efficiently. This way, they provide similar functionality as the uncompressed indexes, while using only slightly more space than the compressed data. The efficiency of data compression is usually measured in terms of entropy. While entropy-based estimates predict the compressed size of most texts accurately, they fail with highly repetitive collections of texts. Examples of such collections include different versions of a document and the genomes of a number of individuals from the same population. While the entropy of a highly repetitive collection is usually similar to that of a text of the same kind, the collection can often be compressed much better than the entropy-based estimate. Most compressed full-text indexes are based on the Burrows-Wheeler transform (BWT). Originally intended for data compression, the BWT has deep connections with full-text indexes such as the suffix tree and the suffix array. With some additional information, these indexes can be simulated with the Burrows-Wheeler transform. The first contribution of this thesis is the first BWT-based index that can compress highly repetitive collections efficiently. Compressed indexes allow us to handle much larger data sets than the corresponding uncompressed indexes. To take full advantage of this, we need algorithms for constructing the compressed index directly, instead of first constructing an uncompressed index and then compressing it. The second contribution of this thesis is an algorithm for merging the BWT-based indexes of two text collections. By using this algorithm, we can derive better space-efficient construction algorithms for BWT-based indexes. The basic BWT-based indexes provide similar functionality as the suffix array. With some additional structures, the functionality can be extended to that of the suffix tree. One of the structures is an array storing the lengths of the longest common prefixes of lexicographically adjacent suffixes of the text. The third contribution of this thesis is a space-efficient algorithm for constructing this array, and a new compressed representation of the array. In the case of individual genomes, the highly repetitive collection can be considered a sample from a larger collection. This collection consists of a reference sequence and a set of possible differences from the reference, so that each sequence contains a subset of the differences. The fourth contribution of this thesis is a BWT-based index that extrapolates the larger collection from the sample and indexes it.Tรคssรค vรคitรถskirjassa kรคsitellรครคn tiivistettyjรค kokotekstihakemistoja tekstimuotoisille aineistoille. Kokotekstihakemistot ovat tietorakenteita, jotka mahdollistavat mielivaltaisten hahmojen esiintymien lรถytรคmisen tekstistรค tehokkaasti. Perinteiset kokotekstihakemistot, kuten loppuosapuut ja -taulukot, vievรคt moninkertaisesti tilaa itse aineistoon nรคhden. Viime aikoina on kuitenkin kehitetty tiivistettyjรค hakemistorakenteita, jotka tarjoavat vastaavan toiminnallisuuden alkuperรคistรค tekstiรค pienemmรคssรค tilassa. Tรคmรค on mahdollistanut aikaisempaa suurempien aineistojen kรคsittelyn. Tekstin tiivistyvyyttรค mitataan yleensรค suhteessa sen entropiaan. Vaikka entropiaan perustuvat arviot ovat useimmilla aineistoilla varsin tarkkoja, aliarvioivat ne vahvasti toisteisien aineistojen tiivistyvyyttรค. Esimerkkejรค tรคllaisista aineistoista ovat kokoelmat saman populaation yksilรถiden genomeita tai saman dokumentin eri versioita. Siinรค missรค tรคllaisen kokoelman entropia suhteessa aineiston kokoon on vastaava kuin yksittรคisellรค samaa tyyppiรค olevalla tekstillรค, tiivistyy kokoelma yleensรค huomattavasti paremmin kuin entropian perusteella voisi odottaa. Useimmat tiivistetyt kokotekstihakemistot perustuvat Burrows-Wheeler-muunnokseen (BWT), joka kehitettiin alun perin tekstimuotoisten aineistojen tiivistรคmiseen. Pian kuitenkin havaittiin, ettรค koska BWT muistuttaa rakenteeltaan loppuosapuuta ja -taulukkoa, voidaan sitรค kรคyttรครค niissรค tehtรคvien hakujen simulointiin. Tรคssรค vรคitรถskirjassa esitetรครคn ensimmรคinen BWT-pohjainen kokotekstihakemisto, joka pystyy tiivistรคmรครคn vahvasti toisteiset aineistot tehokkaasti. Tiivistettyjen tietorakenteiden kรคyttรถ mahdollistaa suurempien aineistoiden kรคsittelemisen kuin tavallisia tietorakenteita kรคytettรคessรค. Tรคmรค etu kuitenkin menetetรครคn, jos tiivistetty tietorakenne muodostetaan luomalla ensin vastaava tavallinen tietorakenne ja tiivistรคmรคllรค se. Tรคssรค vรคitรถskirjassa esitetรครคn aikaisempaa vรคhemmรคn muistia kรคyttรคviรค algoritmeja BWT-pohjaisten kokotekstihakemistojen muodostamiseen. Kokoelma yksilรถiden genomeita voidaan kรคsittรครค otokseksi suuremmasta kokoelmasta, joka koostuu populaation kaikkien yksilรถiden sekรค niiden hypoteettisten jรคlkelรคisten genomeista. Tรคllainen kokoelma voidaan esittรครค รครคrellisenรค automaattina, joka muodostuu referenssigenomista ja yksilรถiden genomeissa esiintyvistรค poikkeamista referenssistรค. Tรคssรค vรคitรถskirjassa esitetรครคn BWT-pohjaisten kokotekstihakemistojen yleistys, joka mahdollistaa tรคllaisten automaattien indeksoinnin

    Storage and retrieval of individual genomes

    Get PDF
    Volume: 5541A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N log N) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O(N log ฯƒ) bits, where ฯƒ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection. We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N / n. We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.Peer reviewe

    Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

    Get PDF
    Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily held in memory, while supporting queries needed in computing language model probabilities on-the-fly. We present several optimisations which improve query runtimes up to 2500x, despite only incurring a modest increase in construction time and memory usage. For large corpora and high Markov orders, our method is highly competitive with the state-of-the-art KenLM package. It imposes much lower memory requirements, often by orders of magnitude, and has runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational Linguistics (TACL) 201

    ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•œ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹๋“ค์˜ ๊ณต๊ฐ„ ํšจ์œจ์  ํ‘œํ˜„๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. Srinivasa Rao Satti.Numerous big data are generated from a plethora of sources. Most of the data stored as files contain a non-fixed type of schema, so that the files are suitable to be maintained as semi-structured document formats. A number of those formats, such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), and YAML (YAML Ain't Markup Language) are suggested to sustain hierarchy in the original corpora of data. Several data models structuring the gathered data - including RDF (Resource Description Framework) - depend on the semi-structured document formats to be serialized and transferred for future processing. Since the semi-structured document formats focus on readability and verbosity, redundant space is required to organize and maintain the document. Even though general-purpose compression schemes are widely used to compact the documents, applying those algorithms hinder future handling of the corpora, owing to loss of internal structures. The area of succinct data structures is widely investigated and researched in theory, to provide answers to the queries while the encoded data occupy space close to the information-theoretic lower bound. Bit vectors and trees are the notable succinct data structures. Nevertheless, there were few attempts to apply the idea of succinct data structures to represent the semi-structured documents in space-efficient manner. In this dissertation we propose a unified, space-efficient representation of various semi-structured document formats. The core functionality of this representation is its compactness and query-ability derived from enriched functions of succinct data structures. Incorporation of (a) bit indexed arrays, (b) succinct ordinal trees, and (c) compression techniques engineers the compact representation. We implement this representation in practice, and show by experiments that construction of this representation decreases the disk usage by up to 60% while occupying 90% less RAM. We also allow processing a document in partial manner, to allow processing of larger corpus of big data even in the constrained environment. In parallel to establishing the aforementioned compact semi-structured document representation, we provide and reinforce some of the existing compression schemes in this dissertation. We first suggest an idea to encode an array of integers that is not necessarily sorted. This compaction scheme improves upon the existing universal code systems, by assistance of succinct bit vector structure. We show that our suggested algorithm reduces space usage by up to 44% while consuming 15% less time than the original code system, while the algorithm additionally supports random access of elements upon the encoded array. We also reinforce the SBH bitmap index compression algorithm. The main strength of this scheme is the use of intermediate super-bucket during operations, giving better performance on querying through a combination of compressed bitmap indexes. Inspired from splits done during the intermediate process of the SBH algorithm, we give an improved compression mechanism supporting parallelism that could be utilized in both CPUs and GPUs. We show by experiments that this CPU parallel processing optimization diminishes compression and decompression times by up to 38% in a 4-core machine without modifying the bitmap compressed form. For GPUs, the new algorithm gives 48% faster query processing time in the experiments, compared to the previous existing bitmap index compression schemes.์…€ ์ˆ˜ ์—†๋Š” ๋น… ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ค์–‘ํ•œ ์›๋ณธ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋˜๊ณ  ์žˆ๋‹ค. ์ด๋“ค ๋ฐ์ดํ„ฐ์˜ ๋Œ€๋ถ€๋ถ„์€ ๊ณ ์ •๋˜์ง€ ์•Š์€ ์ข…๋ฅ˜์˜ ์Šคํ‚ค๋งˆ๋ฅผ ํฌํ•จํ•œ ํŒŒ์ผ ํ˜•ํƒœ๋กœ ์ €์žฅ๋˜๋Š”๋ฐ, ์ด๋กœ ์ธํ•˜์—ฌ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์„ ์ด์šฉํ•˜์—ฌ ํŒŒ์ผ์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ์ ํ•ฉํ•˜๋‹ค. XML, JSON ๋ฐ YAML๊ณผ ๊ฐ™์€ ์ข…๋ฅ˜์˜ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์ด ๋ฐ์ดํ„ฐ์— ๋‚ด์žฌํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ˆ˜์ง‘๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์กฐํ™”ํ•˜๋Š” RDF์™€ ๊ฐ™์€ ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ๋ชจ๋ธ๋“ค์€ ์‚ฌํ›„ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์ €์žฅ ๋ฐ ์ „์†ก์„ ์œ„ํ•˜์—ฌ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์— ์˜์กดํ•œ๋‹ค. ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์€ ๊ฐ€๋…์„ฑ๊ณผ ๋‹ค๋ณ€์„ฑ์— ์ง‘์ค‘ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋ฌธ์„œ๋ฅผ ๊ตฌ์กฐํ™”ํ•˜๊ณ  ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ถ”๊ฐ€์ ์ธ ๊ณต๊ฐ„์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ๋ฌธ์„œ๋ฅผ ์••์ถ•์‹œํ‚ค๊ธฐ ์œ„ํ•˜์—ฌ ์ผ๋ฐ˜์ ์ธ ์••์ถ• ๊ธฐ๋ฒ•๋“ค์ด ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ์œผ๋‚˜, ์ด๋“ค ๊ธฐ๋ฒ•๋“ค์„ ์ ์šฉํ•˜๊ฒŒ ๋˜๋ฉด ๋ฌธ์„œ์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ์˜ ์†์‹ค๋กœ ์ธํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ์‚ฌํ›„ ์ฒ˜๋ฆฌ๊ฐ€ ์–ด๋ ต๊ฒŒ ๋œ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ์ •๋ณด์ด๋ก ์  ํ•˜ํ•œ์— ๊ฐ€๊นŒ์šด ๊ณต๊ฐ„๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ €์žฅ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ฉด์„œ ์งˆ์˜์— ๋Œ€ํ•œ ์‘๋‹ต์„ ์ œ๊ณตํ•˜๋Š” ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๋Š” ์ด๋ก ์ ์œผ๋กœ ๋„๋ฆฌ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋Š” ๋ถ„์•ผ์ด๋‹ค. ๋น„ํŠธ์—ด๊ณผ ํŠธ๋ฆฌ๊ฐ€ ๋„๋ฆฌ ์•Œ๋ ค์ง„ ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๋“ค์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋“ค์„ ์ €์žฅํ•˜๋Š” ๋ฐ ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ์˜ ์•„์ด๋””์–ด๋ฅผ ์ ์šฉํ•œ ์—ฐ๊ตฌ๋Š” ๊ฑฐ์˜ ์ง„ํ–‰๋˜์ง€ ์•Š์•˜๋‹ค. ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์„ ํ†ตํ•ด ์šฐ๋ฆฌ๋Š” ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ˜•์‹์„ ํ†ต์ผ๋˜๊ฒŒ ํ‘œํ˜„ํ•˜๋Š” ๊ณต๊ฐ„ ํšจ์œจ์  ํ‘œํ˜„๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ์ด ๊ธฐ๋ฒ•์˜ ์ฃผ์š”ํ•œ ๊ธฐ๋Šฅ์€ ๊ฐ„๊ฒฐํ•œ ์ž๋ฃŒ๊ตฌ์กฐ๊ฐ€ ๊ฐ•์ ์œผ๋กœ ๊ฐ€์ง€๋Š” ํŠน์„ฑ์— ๊ธฐ๋ฐ˜ํ•œ ๊ฐ„๊ฒฐ์„ฑ๊ณผ ์งˆ์˜ ๊ฐ€๋Šฅ์„ฑ์ด๋‹ค. ๋น„ํŠธ์—ด๋กœ ์ธ๋ฑ์‹ฑ๋œ ๋ฐฐ์—ด, ๊ฐ„๊ฒฐํ•œ ์ˆœ์„œ ์žˆ๋Š” ํŠธ๋ฆฌ ๋ฐ ๋‹ค์–‘ํ•œ ์••์ถ• ๊ธฐ๋ฒ•์„ ํ†ตํ•ฉํ•˜์—ฌ ํ•ด๋‹น ํ‘œํ˜„๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ์ด ๊ธฐ๋ฒ•์€ ์‹ค์žฌ์ ์œผ๋กœ ๊ตฌํ˜„๋˜์—ˆ๊ณ , ์‹คํ—˜์„ ํ†ตํ•˜์—ฌ ์ด ๊ธฐ๋ฒ•์„ ์ ์šฉํ•œ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋“ค์€ ์ตœ๋Œ€ 60% ์ ์€ ๋””์Šคํฌ ๊ณต๊ฐ„๊ณผ 90% ์ ์€ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์„ ํ†ตํ•ด ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค. ๋”๋ถˆ์–ด ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ๋“ค์€ ๋ถ„ํ• ์ ์œผ๋กœ ํ‘œํ˜„์ด ๊ฐ€๋Šฅํ•จ์„ ๋ณด์ด๊ณ , ์ด๋ฅผ ํ†ตํ•˜์—ฌ ์ œํ•œ๋œ ํ™˜๊ฒฝ์—์„œ๋„ ๋น… ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œํ˜„ํ•œ ๋ฌธ์„œ๋“ค์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค. ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๊ณต๊ฐ„ ํšจ์œจ์  ๋ฐ˜๊ตฌ์กฐํ™”๋œ ๋ฌธ์„œ ํ‘œํ˜„๋ฒ•์„ ๊ตฌ์ถ•ํ•จ๊ณผ ๋™์‹œ์—, ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ ์ด๋ฏธ ์กด์žฌํ•˜๋Š” ์••์ถ• ๊ธฐ๋ฒ• ์ค‘ ์ผ๋ถ€๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ๊ฐœ์„ ํ•œ๋‹ค. ์ฒซ์งธ๋กœ, ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ์ •๋ ฌ ์—ฌ๋ถ€์— ๊ด€๊ณ„์—†๋Š” ์ •์ˆ˜ ๋ฐฐ์—ด์„ ๋ถ€ํ˜ธํ™”ํ•˜๋Š” ์•„์ด๋””์–ด๋ฅผ ์ œ์‹œํ•œ๋‹ค. ์ด ๊ธฐ๋ฒ•์€ ์ด๋ฏธ ์กด์žฌํ•˜๋Š” ๋ฒ”์šฉ ์ฝ”๋“œ ์‹œ์Šคํ…œ์„ ๊ฐœ์„ ํ•œ ํ˜•ํƒœ๋กœ, ๊ฐ„๊ฒฐํ•œ ๋น„ํŠธ์—ด ์ž๋ฃŒ๊ตฌ์กฐ๋ฅผ ์ด์šฉํ•œ๋‹ค. ์ œ์•ˆ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ธฐ์กด ๋ฒ”์šฉ ์ฝ”๋“œ ์‹œ์Šคํ…œ์— ๋น„ํ•ด ์ตœ๋Œ€ 44\% ์ ์€ ๊ณต๊ฐ„์„ ์‚ฌ์šฉํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ 15\% ์ ์€ ๋ถ€ํ˜ธํ™” ์‹œ๊ฐ„์„ ํ•„์š”๋กœ ํ•˜๋ฉฐ, ๊ธฐ์กด ์‹œ์Šคํ…œ์—์„œ ์ œ๊ณตํ•˜์ง€ ์•Š๋Š” ๋ถ€ํ˜ธํ™”๋œ ๋ฐฐ์—ด์—์„œ์˜ ์ž„์˜ ์ ‘๊ทผ์„ ์ง€์›ํ•œ๋‹ค. ๋˜ํ•œ ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ๋น„ํŠธ๋งต ์ธ๋ฑ์Šค ์••์ถ•์— ์‚ฌ์šฉ๋˜๋Š” SBH ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐœ์„ ์‹œํ‚จ๋‹ค. ํ•ด๋‹น ๊ธฐ๋ฒ•์˜ ์ฃผ๋œ ๊ฐ•์ ์€ ๋ถ€ํ˜ธํ™”์™€ ๋ณตํ˜ธํ™” ์ง„ํ–‰ ์‹œ ์ค‘๊ฐ„ ๋งค๊ฐœ์ธ ์Šˆํผ๋ฒ„์ผ“์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์—ฌ๋Ÿฌ ์••์ถ•๋œ ๋น„ํŠธ๋งต ์ธ๋ฑ์Šค์— ๋Œ€ํ•œ ์งˆ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค. ์œ„ ์••์ถ• ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ค‘๊ฐ„ ๊ณผ์ •์—์„œ ์ง„ํ–‰๋˜๋Š” ๋ถ„ํ• ์—์„œ ์˜๊ฐ์„ ์–ป์–ด, ๋ณธ ํ•™์œ„๋…ผ๋ฌธ์—์„œ CPU ๋ฐ GPU์— ์ ์šฉ ๊ฐ€๋Šฅํ•œ ๊ฐœ์„ ๋œ ๋ณ‘๋ ฌํ™” ์••์ถ• ๋งค์ปค๋‹ˆ์ฆ˜์„ ์ œ์‹œํ•œ๋‹ค. ์‹คํ—˜์„ ํ†ตํ•ด CPU ๋ณ‘๋ ฌ ์ตœ์ ํ™”๊ฐ€ ์ด๋ฃจ์–ด์ง„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์••์ถ•๋œ ํ˜•ํƒœ์˜ ๋ณ€ํ˜• ์—†์ด 4์ฝ”์–ด ์ปดํ“จํ„ฐ์—์„œ ์ตœ๋Œ€ 38\%์˜ ์••์ถ• ๋ฐ ํ•ด์ œ ์‹œ๊ฐ„์„ ๊ฐ์†Œ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค. GPU ๋ณ‘๋ ฌ ์ตœ์ ํ™”๋Š” ๊ธฐ์กด์— ์กด์žฌํ•˜๋Š” GPU ๋น„ํŠธ๋งต ์••์ถ• ๊ธฐ๋ฒ•์— ๋น„ํ•ด 48\% ๋น ๋ฅธ ์งˆ์˜ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ ํ•„์š”๋กœ ํ•จ์„ ํ™•์ธํ•œ๋‹ค.Chapter 1 Introduction 1 1.1 Contribution 3 1.2 Organization 5 Chapter 2 Background 6 2.1 Model of Computation 6 2.2 Succinct Data Structures 7 Chapter 3 Space-efficient Representation of Integer Arrays 9 3.1 Introduction 9 3.2 Preliminaries 10 3.2.1 Universal Code System 10 3.2.2 Bit Vector 13 3.3 Algorithm Description 13 3.3.1 Main Principle 14 3.3.2 Optimization in the Implementation 16 3.4 Experimental Results 16 Chapter 4 Space-efficient Parallel Compressed Bitmap Index Processing 19 4.1 Introduction 19 4.2 Related Work 23 4.2.1 Byte-aligned Bitmap Code (BBC) 24 4.2.2 Word-Aligned Hybrid (WAH) 27 4.2.3 WAH-derived Algorithms 28 4.2.4 GPU-based WAH Algorithms 31 4.2.5 Super Byte-aligned Hybrid (SBH) 33 4.3 Parallelizing SBH 38 4.3.1 CPU Parallelism 38 4.3.2 GPU Parallelism 39 4.4 Experimental Results 40 4.4.1 Plain Version 41 4.4.2 Parallelized Version 46 4.4.3 Summary 49 Chapter 5 Space-efficient Representation of Semi-structured Document Formats 50 5.1 Preliminaries 50 5.1.1 Semi-structured Document Formats 50 5.1.2 Resource Description Framework 57 5.1.3 Succinct Ordinal Tree Representations 60 5.1.4 String Compression Schemes 64 5.2 Representation 66 5.2.1 Bit String Indexed Array 67 5.2.2 Main Structure 68 5.2.3 Single Document as a Collection of Chunks 72 5.2.4 Supporting Queries 73 5.3 Experimental Results 75 5.3.1 Datasets 76 5.3.2 Construction Time 78 5.3.3 RAM Usage during Construction 80 5.3.4 Disk Usage and Serialization Time 83 5.3.5 Chunk Division 83 5.3.6 String Compression 88 5.3.7 Query Time 89 Chapter 6 Conclusion 94 Bibliography 96 ์š”์•ฝ 109 Acknowledgements 111Docto
    • โ€ฆ
    corecore