364 research outputs found
Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space
Indexing highly repetitive texts - such as genomic databases, software
repositories and versioned text collections - has become an important problem
since the turn of the millennium. A relevant compressibility measure for
repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms
(BWTs). One of the earliest indexes for repetitive collections, the Run-Length
FM-index, used O(r) space and was able to efficiently count the number of
occurrences of a pattern of length m in the text (in loglogarithmic time per
pattern symbol, with current techniques). However, it was unable to locate the
positions of those occurrences efficiently within a space bounded in terms of
r. In this paper we close this long-standing problem, showing how to extend the
Run-Length FM-index so that it can locate the occ occurrences efficiently
within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m
+ occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n
over an alphabet of size {\sigma} on a RAM machine with words of w =
{\Omega}(log n) bits. Within that space, our index can also count in optimal
time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and
locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which
is optimal in the packed setting and had not been obtained before in compressed
space. We also describe a structure using O(r log(n/r)) space that replaces the
text and extracts any text substring of length ` in almost-optimal time
O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct
access to suffix array, inverse suffix array, and longest common prefix array
cells, and extend these capabilities to full suffix tree functionality,
typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log
log_w(n/r + sigma)
Optimal-Time Text Indexing in BWT-runs Bounded Space
Indexing highly repetitive texts --- such as genomic databases, software
repositories and versioned text collections --- has become an important problem
since the turn of the millennium. A relevant compressibility measure for
repetitive texts is , the number of runs in their Burrows-Wheeler Transform
(BWT). One of the earliest indexes for repetitive collections, the Run-Length
FM-index, used space and was able to efficiently count the number of
occurrences of a pattern of length in the text (in loglogarithmic time per
pattern symbol, with current techniques). However, it was unable to locate the
positions of those occurrences efficiently within a space bounded in terms of
. Since then, a number of other indexes with space bounded by other measures
of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size
of the smallest grammar generating the text, the size of the smallest automaton
recognizing the text factors --- have been proposed for efficiently locating,
but not directly counting, the occurrences of a pattern. In this paper we close
this long-standing problem, showing how to extend the Run-Length FM-index so
that it can locate the occurrences efficiently within space (in
loglogarithmic time each), and reaching optimal time within
space, on a RAM machine of bits. Within
space, our index can also count in optimal time .
Raising the space to , we support count and locate in
and time, which is optimal in the
packed setting and had not been obtained before in compressed space. We also
describe a structure using space that replaces the text and
extracts any text substring of length in almost-optimal time
. (...continues...
Compressed Text Indexes:From Theory to Practice!
A compressed full-text self-index represents a text in a compressed form and
still answers queries efficiently. This technology represents a breakthrough
over the text indexing techniques of the previous decade, whose indexes
required several times the size of the text. Although it is relatively new,
this technology has matured up to a point where theoretical research is giving
way to practical developments. Nonetheless this requires significant
programming skills, a deep engineering effort, and a strong algorithmic
background to dig into the research results. To date only isolated
implementations and focused comparisons of compressed indexes have been
reported, and they missed a common API, which prevented their re-use or
deployment within other applications.
The goal of this paper is to fill this gap. First, we present the existing
implementations of compressed indexes from a practitioner's point of view.
Second, we introduce the Pizza&Chili site, which offers tuned implementations
and a standardized API for the most successful compressed full-text
self-indexes, together with effective testbeds and scripts for their automatic
validation and test. Third, we show the results of our extensive experiments on
these codes with the aim of demonstrating the practical relevance of this novel
and exciting technology
Compressed Full-Text Indexes for Highly Repetitive Collections
This thesis studies problems related to compressed full-text indexes. A full-text index is a data structure for indexing textual (sequence) data, so that the occurrences of any query string in the data can be found efficiently. While most full-text indexes require much more space than the sequences they index, recent compressed indexes have overcome this limitation. These compressed indexes combine a compressed representation of the index with some extra information that allows decompressing any part of the data efficiently. This way, they provide similar functionality as the uncompressed indexes, while using only slightly more space than the compressed data.
The efficiency of data compression is usually measured in terms of entropy. While entropy-based estimates predict the compressed size of most texts accurately, they fail with highly repetitive collections of texts. Examples of such collections include different versions of a document and the genomes of a number of individuals from the same population. While the entropy of a highly repetitive collection is usually similar to that of a text of the same kind, the collection can often be compressed much better than the entropy-based estimate.
Most compressed full-text indexes are based on the Burrows-Wheeler transform (BWT). Originally intended for data compression, the BWT has deep connections with full-text indexes such as the suffix tree and the suffix array. With some additional information, these indexes can be simulated with the Burrows-Wheeler transform. The first contribution of this thesis is the first BWT-based index that can compress highly repetitive collections efficiently.
Compressed indexes allow us to handle much larger data sets than the corresponding uncompressed indexes. To take full advantage of this, we need algorithms for constructing the compressed index directly, instead of first constructing an uncompressed index and then compressing it. The second contribution of this thesis is an algorithm for merging the BWT-based indexes of two text collections. By using this algorithm, we can derive better space-efficient construction algorithms for BWT-based indexes.
The basic BWT-based indexes provide similar functionality as the suffix array. With some additional structures, the functionality can be extended to that of the suffix tree. One of the structures is an array storing the lengths of the longest common prefixes of lexicographically adjacent suffixes of the text. The third contribution of this thesis is a space-efficient algorithm for constructing this array, and a new compressed representation of the array.
In the case of individual genomes, the highly repetitive collection can be considered a sample from a larger collection. This collection consists of a reference sequence and a set of possible differences from the reference, so that each sequence contains a subset of the differences. The fourth contribution of this thesis is a BWT-based index that extrapolates the larger collection from the sample and indexes it.Tรคssรค vรคitรถskirjassa kรคsitellรครคn tiivistettyjรค kokotekstihakemistoja tekstimuotoisille aineistoille. Kokotekstihakemistot ovat tietorakenteita, jotka mahdollistavat mielivaltaisten hahmojen esiintymien lรถytรคmisen tekstistรค tehokkaasti. Perinteiset kokotekstihakemistot, kuten loppuosapuut ja -taulukot, vievรคt moninkertaisesti tilaa itse aineistoon nรคhden. Viime aikoina on kuitenkin kehitetty tiivistettyjรค hakemistorakenteita, jotka tarjoavat vastaavan toiminnallisuuden alkuperรคistรค tekstiรค pienemmรคssรค tilassa. Tรคmรค on mahdollistanut aikaisempaa suurempien aineistojen kรคsittelyn.
Tekstin tiivistyvyyttรค mitataan yleensรค suhteessa sen entropiaan. Vaikka entropiaan perustuvat arviot ovat useimmilla aineistoilla varsin tarkkoja, aliarvioivat ne vahvasti toisteisien aineistojen tiivistyvyyttรค. Esimerkkejรค tรคllaisista aineistoista ovat kokoelmat saman populaation yksilรถiden genomeita tai saman dokumentin eri versioita. Siinรค missรค tรคllaisen kokoelman entropia suhteessa aineiston kokoon on vastaava kuin yksittรคisellรค samaa tyyppiรค olevalla tekstillรค, tiivistyy kokoelma yleensรค huomattavasti paremmin kuin entropian perusteella voisi odottaa.
Useimmat tiivistetyt kokotekstihakemistot perustuvat Burrows-Wheeler-muunnokseen (BWT), joka kehitettiin alun perin tekstimuotoisten aineistojen tiivistรคmiseen. Pian kuitenkin havaittiin, ettรค koska BWT muistuttaa rakenteeltaan loppuosapuuta ja -taulukkoa, voidaan sitรค kรคyttรครค niissรค tehtรคvien hakujen simulointiin. Tรคssรค vรคitรถskirjassa esitetรครคn ensimmรคinen BWT-pohjainen kokotekstihakemisto, joka pystyy tiivistรคmรครคn vahvasti toisteiset aineistot tehokkaasti.
Tiivistettyjen tietorakenteiden kรคyttรถ mahdollistaa suurempien aineistoiden kรคsittelemisen kuin tavallisia tietorakenteita kรคytettรคessรค. Tรคmรค etu kuitenkin menetetรครคn, jos tiivistetty tietorakenne muodostetaan luomalla ensin vastaava tavallinen tietorakenne ja tiivistรคmรคllรค se. Tรคssรค vรคitรถskirjassa esitetรครคn aikaisempaa vรคhemmรคn muistia kรคyttรคviรค algoritmeja BWT-pohjaisten kokotekstihakemistojen muodostamiseen.
Kokoelma yksilรถiden genomeita voidaan kรคsittรครค otokseksi suuremmasta kokoelmasta, joka koostuu populaation kaikkien yksilรถiden sekรค niiden hypoteettisten jรคlkelรคisten genomeista. Tรคllainen kokoelma voidaan esittรครค รครคrellisenรค automaattina, joka muodostuu referenssigenomista ja yksilรถiden genomeissa esiintyvistรค poikkeamista referenssistรค. Tรคssรค vรคitรถskirjassa esitetรครคn BWT-pohjaisten kokotekstihakemistojen yleistys, joka mahdollistaa tรคllaisten automaattien indeksoinnin
Storage and retrieval of individual genomes
Volume: 5541A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N log N) bits, which very soon inhibits in-memory analyses. Recent advances in full-text self-indexing reduce the space of suffix tree to O(N log ฯ) bits, where ฯ is the alphabet size. In practice, the space reduction is more than 10-fold, for example on suffix tree of Human Genome. However, this reduction factor remains constant when more sequences are added to the collection. We develop a new family of self-indexes suited for the repetitive sequence collection setting. Their expected space requirement depends only on the length n of the base sequence and the number s of variations in its repeated copies. That is, the space reduction factor is no longer constant, but depends on N / n. We believe the structures developed in this work will provide a fundamental basis for storage and retrieval of individual genomes as they become available due to rapid progress in the sequencing technologies.Peer reviewe
Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees
Efficient methods for storing and querying are critical for scaling
high-order n-gram language models to large corpora. We propose a language model
based on compressed suffix trees, a representation that is highly compact and
can be easily held in memory, while supporting queries needed in computing
language model probabilities on-the-fly. We present several optimisations which
improve query runtimes up to 2500x, despite only incurring a modest increase in
construction time and memory usage. For large corpora and high Markov orders,
our method is highly competitive with the state-of-the-art KenLM package. It
imposes much lower memory requirements, often by orders of magnitude, and has
runtimes that are either similar (for training) or comparable (for querying).Comment: 14 pages in Transactions of the Association for Computational
Linguistics (TACL) 201
๊ฐ๊ฒฐํ ์๋ฃ๊ตฌ์กฐ๋ฅผ ํ์ฉํ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์๋ค์ ๊ณต๊ฐ ํจ์จ์ ํํ๋ฒ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2021. 2. Srinivasa Rao Satti.Numerous big data are generated from a plethora of sources. Most of the data stored as files contain a non-fixed type of schema, so that the files are suitable to be maintained as semi-structured document formats. A number of those formats, such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), and YAML (YAML Ain't Markup Language) are suggested to sustain hierarchy in the original corpora of data. Several data models structuring the gathered data - including RDF (Resource Description Framework) - depend on the semi-structured document formats to be serialized and transferred for future processing.
Since the semi-structured document formats focus on readability and verbosity, redundant space is required to organize and maintain the document. Even though general-purpose compression schemes are widely used to compact the documents, applying those algorithms hinder future handling of the corpora, owing to loss of internal structures.
The area of succinct data structures is widely investigated and researched in theory, to provide answers to the queries while the encoded data occupy space close to the information-theoretic lower bound. Bit vectors and trees are the notable succinct data structures. Nevertheless, there were few attempts to apply the idea of succinct data structures to represent the semi-structured documents in space-efficient manner.
In this dissertation we propose a unified, space-efficient representation of various semi-structured document formats. The core functionality of this representation is its compactness and query-ability derived from enriched functions of succinct data structures. Incorporation of (a) bit indexed arrays, (b) succinct ordinal trees, and (c) compression techniques engineers the compact representation. We implement this representation in practice, and show by experiments that construction of this representation decreases the disk usage by up to 60% while occupying 90% less RAM. We also allow processing a document in partial manner, to allow processing of larger corpus of big data even in the constrained environment.
In parallel to establishing the aforementioned compact semi-structured document representation, we provide and reinforce some of the existing compression schemes in this dissertation. We first suggest an idea to encode an array of integers that is not necessarily sorted. This compaction scheme improves upon the existing universal code systems, by assistance of succinct bit vector structure. We show that our suggested algorithm reduces space usage by up to 44% while consuming 15% less time than the original code system, while the algorithm additionally supports random access of elements upon the encoded array.
We also reinforce the SBH bitmap index compression algorithm. The main strength of this scheme is the use of intermediate super-bucket during operations, giving better performance on querying through a combination of compressed bitmap indexes. Inspired from splits done during the intermediate process of the SBH algorithm, we give an improved compression mechanism supporting parallelism that could be utilized in both CPUs and GPUs. We show by experiments that this CPU parallel processing optimization diminishes compression and decompression times by up to 38% in a 4-core machine without modifying the bitmap compressed form. For GPUs, the new algorithm gives 48% faster query processing time in the experiments, compared to the previous existing bitmap index compression schemes.์
์ ์๋ ๋น
๋ฐ์ดํฐ๊ฐ ๋ค์ํ ์๋ณธ๋ก๋ถํฐ ์์ฑ๋๊ณ ์๋ค. ์ด๋ค ๋ฐ์ดํฐ์ ๋๋ถ๋ถ์ ๊ณ ์ ๋์ง ์์ ์ข
๋ฅ์ ์คํค๋ง๋ฅผ ํฌํจํ ํ์ผ ํํ๋ก ์ ์ฅ๋๋๋ฐ, ์ด๋ก ์ธํ์ฌ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์์ ์ด์ฉํ์ฌ ํ์ผ์ ์ ์งํ๋ ๊ฒ์ด ์ ํฉํ๋ค. XML, JSON ๋ฐ YAML๊ณผ ๊ฐ์ ์ข
๋ฅ์ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์์ด ๋ฐ์ดํฐ์ ๋ด์ฌํ๋ ๊ตฌ์กฐ๋ฅผ ์ ์งํ๊ธฐ ์ํ์ฌ ์ ์๋์๋ค. ์์ง๋ ๋ฐ์ดํฐ๋ฅผ ๊ตฌ์กฐํํ๋ RDF์ ๊ฐ์ ์ฌ๋ฌ ๋ฐ์ดํฐ ๋ชจ๋ธ๋ค์ ์ฌํ ์ฒ๋ฆฌ๋ฅผ ์ํ ์ ์ฅ ๋ฐ ์ ์ก์ ์ํ์ฌ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์์ ์์กดํ๋ค.
๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์์ ๊ฐ๋
์ฑ๊ณผ ๋ค๋ณ์ฑ์ ์ง์คํ๊ธฐ ๋๋ฌธ์, ๋ฌธ์๋ฅผ ๊ตฌ์กฐํํ๊ณ ์ ์งํ๊ธฐ ์ํ์ฌ ์ถ๊ฐ์ ์ธ ๊ณต๊ฐ์ ํ์๋ก ํ๋ค. ๋ฌธ์๋ฅผ ์์ถ์ํค๊ธฐ ์ํ์ฌ ์ผ๋ฐ์ ์ธ ์์ถ ๊ธฐ๋ฒ๋ค์ด ๋๋ฆฌ ์ฌ์ฉ๋๊ณ ์์ผ๋, ์ด๋ค ๊ธฐ๋ฒ๋ค์ ์ ์ฉํ๊ฒ ๋๋ฉด ๋ฌธ์์ ๋ด๋ถ ๊ตฌ์กฐ์ ์์ค๋ก ์ธํ์ฌ ๋ฐ์ดํฐ์ ์ฌํ ์ฒ๋ฆฌ๊ฐ ์ด๋ ต๊ฒ ๋๋ค.
๋ฐ์ดํฐ๋ฅผ ์ ๋ณด์ด๋ก ์ ํํ์ ๊ฐ๊น์ด ๊ณต๊ฐ๋ง์ ์ฌ์ฉํ์ฌ ์ ์ฅ์ ๊ฐ๋ฅํ๊ฒ ํ๋ฉด์ ์ง์์ ๋ํ ์๋ต์ ์ ๊ณตํ๋ ๊ฐ๊ฒฐํ ์๋ฃ๊ตฌ์กฐ๋ ์ด๋ก ์ ์ผ๋ก ๋๋ฆฌ ์ฐ๊ตฌ๋๊ณ ์๋ ๋ถ์ผ์ด๋ค. ๋นํธ์ด๊ณผ ํธ๋ฆฌ๊ฐ ๋๋ฆฌ ์๋ ค์ง ๊ฐ๊ฒฐํ ์๋ฃ๊ตฌ์กฐ๋ค์ด๋ค. ๊ทธ๋ฌ๋ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์๋ค์ ์ ์ฅํ๋ ๋ฐ ๊ฐ๊ฒฐํ ์๋ฃ๊ตฌ์กฐ์ ์์ด๋์ด๋ฅผ ์ ์ฉํ ์ฐ๊ตฌ๋ ๊ฑฐ์ ์งํ๋์ง ์์๋ค.
๋ณธ ํ์๋
ผ๋ฌธ์ ํตํด ์ฐ๋ฆฌ๋ ๋ค์ํ ์ข
๋ฅ์ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์์ ํต์ผ๋๊ฒ ํํํ๋ ๊ณต๊ฐ ํจ์จ์ ํํ๋ฒ์ ์ ์ํ๋ค. ์ด ๊ธฐ๋ฒ์ ์ฃผ์ํ ๊ธฐ๋ฅ์ ๊ฐ๊ฒฐํ ์๋ฃ๊ตฌ์กฐ๊ฐ ๊ฐ์ ์ผ๋ก ๊ฐ์ง๋ ํน์ฑ์ ๊ธฐ๋ฐํ ๊ฐ๊ฒฐ์ฑ๊ณผ ์ง์ ๊ฐ๋ฅ์ฑ์ด๋ค. ๋นํธ์ด๋ก ์ธ๋ฑ์ฑ๋ ๋ฐฐ์ด, ๊ฐ๊ฒฐํ ์์ ์๋ ํธ๋ฆฌ ๋ฐ ๋ค์ํ ์์ถ ๊ธฐ๋ฒ์ ํตํฉํ์ฌ ํด๋น ํํ๋ฒ์ ๊ณ ์ํ์๋ค. ์ด ๊ธฐ๋ฒ์ ์ค์ฌ์ ์ผ๋ก ๊ตฌํ๋์๊ณ , ์คํ์ ํตํ์ฌ ์ด ๊ธฐ๋ฒ์ ์ ์ฉํ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์๋ค์ ์ต๋ 60% ์ ์ ๋์คํฌ ๊ณต๊ฐ๊ณผ 90% ์ ์ ๋ฉ๋ชจ๋ฆฌ ๊ณต๊ฐ์ ํตํด ํํ๋ ์ ์๋ค๋ ๊ฒ์ ๋ณด์ธ๋ค. ๋๋ถ์ด ๋ณธ ํ์๋
ผ๋ฌธ์์ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์๋ค์ ๋ถํ ์ ์ผ๋ก ํํ์ด ๊ฐ๋ฅํจ์ ๋ณด์ด๊ณ , ์ด๋ฅผ ํตํ์ฌ ์ ํ๋ ํ๊ฒฝ์์๋ ๋น
๋ฐ์ดํฐ๋ฅผ ํํํ ๋ฌธ์๋ค์ ์ฒ๋ฆฌํ ์ ์๋ค๋ ๊ฒ์ ๋ณด์ธ๋ค.
์์ ์ธ๊ธํ ๊ณต๊ฐ ํจ์จ์ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํํ๋ฒ์ ๊ตฌ์ถํจ๊ณผ ๋์์, ๋ณธ ํ์๋
ผ๋ฌธ์์ ์ด๋ฏธ ์กด์ฌํ๋ ์์ถ ๊ธฐ๋ฒ ์ค ์ผ๋ถ๋ฅผ ์ถ๊ฐ์ ์ผ๋ก ๊ฐ์ ํ๋ค. ์ฒซ์งธ๋ก, ๋ณธ ํ์๋
ผ๋ฌธ์์๋ ์ ๋ ฌ ์ฌ๋ถ์ ๊ด๊ณ์๋ ์ ์ ๋ฐฐ์ด์ ๋ถํธํํ๋ ์์ด๋์ด๋ฅผ ์ ์ํ๋ค. ์ด ๊ธฐ๋ฒ์ ์ด๋ฏธ ์กด์ฌํ๋ ๋ฒ์ฉ ์ฝ๋ ์์คํ
์ ๊ฐ์ ํ ํํ๋ก, ๊ฐ๊ฒฐํ ๋นํธ์ด ์๋ฃ๊ตฌ์กฐ๋ฅผ ์ด์ฉํ๋ค. ์ ์๋ ์๊ณ ๋ฆฌ์ฆ์ ๊ธฐ์กด ๋ฒ์ฉ ์ฝ๋ ์์คํ
์ ๋นํด ์ต๋ 44\% ์ ์ ๊ณต๊ฐ์ ์ฌ์ฉํ ๋ฟ๋ง ์๋๋ผ 15\% ์ ์ ๋ถํธํ ์๊ฐ์ ํ์๋ก ํ๋ฉฐ, ๊ธฐ์กด ์์คํ
์์ ์ ๊ณตํ์ง ์๋ ๋ถํธํ๋ ๋ฐฐ์ด์์์ ์์ ์ ๊ทผ์ ์ง์ํ๋ค.
๋ํ ๋ณธ ํ์๋
ผ๋ฌธ์์๋ ๋นํธ๋งต ์ธ๋ฑ์ค ์์ถ์ ์ฌ์ฉ๋๋ SBH ์๊ณ ๋ฆฌ์ฆ์ ๊ฐ์ ์ํจ๋ค. ํด๋น ๊ธฐ๋ฒ์ ์ฃผ๋ ๊ฐ์ ์ ๋ถํธํ์ ๋ณตํธํ ์งํ ์ ์ค๊ฐ ๋งค๊ฐ์ธ ์ํผ๋ฒ์ผ์ ์ฌ์ฉํจ์ผ๋ก์จ ์ฌ๋ฌ ์์ถ๋ ๋นํธ๋งต ์ธ๋ฑ์ค์ ๋ํ ์ง์ ์ฑ๋ฅ์ ๊ฐ์ ์ํค๋ ๊ฒ์ด๋ค. ์ ์์ถ ์๊ณ ๋ฆฌ์ฆ์ ์ค๊ฐ ๊ณผ์ ์์ ์งํ๋๋ ๋ถํ ์์ ์๊ฐ์ ์ป์ด, ๋ณธ ํ์๋
ผ๋ฌธ์์ CPU ๋ฐ GPU์ ์ ์ฉ ๊ฐ๋ฅํ ๊ฐ์ ๋ ๋ณ๋ ฌํ ์์ถ ๋งค์ปค๋์ฆ์ ์ ์ํ๋ค. ์คํ์ ํตํด CPU ๋ณ๋ ฌ ์ต์ ํ๊ฐ ์ด๋ฃจ์ด์ง ์๊ณ ๋ฆฌ์ฆ์ ์์ถ๋ ํํ์ ๋ณํ ์์ด 4์ฝ์ด ์ปดํจํฐ์์ ์ต๋ 38\%์ ์์ถ ๋ฐ ํด์ ์๊ฐ์ ๊ฐ์์ํจ๋ค๋ ๊ฒ์ ๋ณด์ธ๋ค. GPU ๋ณ๋ ฌ ์ต์ ํ๋ ๊ธฐ์กด์ ์กด์ฌํ๋ GPU ๋นํธ๋งต ์์ถ ๊ธฐ๋ฒ์ ๋นํด 48\% ๋น ๋ฅธ ์ง์ ์ฒ๋ฆฌ ์๊ฐ์ ํ์๋ก ํจ์ ํ์ธํ๋ค.Chapter 1 Introduction 1
1.1 Contribution 3
1.2 Organization 5
Chapter 2 Background 6
2.1 Model of Computation 6
2.2 Succinct Data Structures 7
Chapter 3 Space-efficient Representation of Integer Arrays 9
3.1 Introduction 9
3.2 Preliminaries 10
3.2.1 Universal Code System 10
3.2.2 Bit Vector 13
3.3 Algorithm Description 13
3.3.1 Main Principle 14
3.3.2 Optimization in the Implementation 16
3.4 Experimental Results 16
Chapter 4 Space-efficient Parallel Compressed Bitmap Index Processing 19
4.1 Introduction 19
4.2 Related Work 23
4.2.1 Byte-aligned Bitmap Code (BBC) 24
4.2.2 Word-Aligned Hybrid (WAH) 27
4.2.3 WAH-derived Algorithms 28
4.2.4 GPU-based WAH Algorithms 31
4.2.5 Super Byte-aligned Hybrid (SBH) 33
4.3 Parallelizing SBH 38
4.3.1 CPU Parallelism 38
4.3.2 GPU Parallelism 39
4.4 Experimental Results 40
4.4.1 Plain Version 41
4.4.2 Parallelized Version 46
4.4.3 Summary 49
Chapter 5 Space-efficient Representation of Semi-structured Document Formats 50
5.1 Preliminaries 50
5.1.1 Semi-structured Document Formats 50
5.1.2 Resource Description Framework 57
5.1.3 Succinct Ordinal Tree Representations 60
5.1.4 String Compression Schemes 64
5.2 Representation 66
5.2.1 Bit String Indexed Array 67
5.2.2 Main Structure 68
5.2.3 Single Document as a Collection of Chunks 72
5.2.4 Supporting Queries 73
5.3 Experimental Results 75
5.3.1 Datasets 76
5.3.2 Construction Time 78
5.3.3 RAM Usage during Construction 80
5.3.4 Disk Usage and Serialization Time 83
5.3.5 Chunk Division 83
5.3.6 String Compression 88
5.3.7 Query Time 89
Chapter 6 Conclusion 94
Bibliography 96
์์ฝ 109
Acknowledgements 111Docto
- โฆ