245 research outputs found
bdbms -- A Database Management System for Biological Data
Biologists are increasingly using databases for storing and managing their
data. Biological databases typically consist of a mixture of raw data,
metadata, sequences, annotations, and related data obtained from various
sources. Current database technology lacks several functionalities that are
needed by biological databases. In this paper, we introduce bdbms, an
extensible prototype database management system for supporting biological data.
bdbms extends the functionalities of current DBMSs to include: (1) Annotation
and provenance management including storage, indexing, manipulation, and
querying of annotation and provenance as first class objects in bdbms, (2)
Local dependency tracking to track the dependencies and derivations among data
items, (3) Update authorization to support data curation via content-based
authorization, in contrast to identity-based authorization, and (4) New access
methods and their supporting operators that support pattern matching on various
types of compressed biological data types. This paper presents the design of
bdbms along with the techniques proposed to support these functionalities
including an extension to SQL. We also outline some open issues in building
bdbms.Comment: This article is published under a Creative Commons License Agreement
(http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute,
display, and perform the work, make derivative works and make commercial use
of the work, but, you must attribute the work to the author and CIDR 2007.
3rd Biennial Conference on Innovative Data Systems Research (CIDR) January
710, 2007, Asilomar, California, US
Finding Top-k Dominance on Incomplete Big Data Using Map-Reduce Framework
Incomplete data is one major kind of multi-dimensional dataset that has random-distributed missing nodes in its dimensions. It is very difficult to retrieve information from this type of dataset when it becomes huge. Finding top-k dominant values in this type of dataset is a challenging procedure. Some algorithms are present to enhance this process but are mostly efficient only when dealing with a small-size incomplete data. One of the algorithms that make the application of TKD query possible is the Bitmap Index Guided (BIG) algorithm. This algorithm strongly improves the performance for incomplete data, but it is not originally capable of finding top-k dominant values in incomplete big data, nor is it designed to do so. Several other algorithms have been proposed to find the TKD query, such as Skyband Based and Upper Bound Based algorithms, but their performance is also questionable. Algorithms developed previously were among the first attempts to apply TKD query on incomplete data; however, all these had weak performances or were not compatible with the incomplete data. This thesis proposes MapReduced Enhanced Bitmap Index Guided Algorithm (MRBIG) for dealing with the aforementioned issues. MRBIG uses the MapReduce framework to enhance the performance of applying top-k dominance queries on huge incomplete datasets. The proposed approach uses the MapReduce parallel computing approach using multiple computing nodes. The framework separates the tasks between several computing nodes that independently and simultaneously work to find the result. This method has achieved up to two times faster processing time in finding the TKD query result in comparison to previously presented algorithms
Multi-level bitmap indexes for flash memory storage
Due to their low access latency, high read speed, and power-efficient operation, flash memory storage devices are rapidly emerging as an attractive alternative to traditional magnetic storage devices. However, tests show that the most efficient indexing methods are not able to take advantage of the flash memory storage devices. In this paper, we present a set of multi-level bitmap indexes that can effectively take advantage of flash storage devices. These indexing methods use coarsely binned indexes to answer queries approximately, and then use finely binned indexes to refine the answers. Our new methods read significantly lower volumes of data at the expense of an increased disk access count, thus taking full advantage of the improved read speed and low access latency of flash devices. To demonstrate the advantage of these new indexes, we measure their performance on a number of storage systems using a standard data warehousing benchmark called the Set Query Benchmark. We observe that multi-level strategies on flash drives are up to 3 times faster than traditional indexing strategies on magnetic disk drives
Column Imprints: A Secondary Index Structure
Large scale data warehouses rely heavily on secondary indexes,
such as bitmaps and b-trees, to limit access to slow IO devices.
However, with the advent of large main memory systems, cache
conscious secondary indexes are needed to improve also the transfer
bandwidth between memory and cpu. In this paper, we introduce
column imprint, a simple but efficient cache conscious secondary
index. A column imprint is a collection of many small bit
vectors, each indexing the data points of a single cacheline. An
imprint is used during query evaluation to limit data access and
thus minimize memory traffic. The compression for imprints is
cpu friendly and exploits the empirical observation that data often
exhibits local clustering or partial ordering as a side-effect of the
construction process. Most importantly, column imprint compression
remains effective and robust even in the case of unclustered
data, while other state-of-the-art solutions fail. We conducted an
extensive experimental evaluation to assess the applicability and
the performance impact of the column imprints. The storage overhead,
when experimenting with real world datasets, is just a few
percent over the size of the columns being indexed. The evaluation
time for over 40000 range queries of varying selectivity revealed
the efficiency of the proposed index compar
Rapid Sampling for Visualizations with Ordering Guarantees
Visualizations are frequently used as a means to understand trends and gather
insights from datasets, but often take a long time to generate. In this paper,
we focus on the problem of rapidly generating approximate visualizations while
preserving crucial visual proper- ties of interest to analysts. Our primary
focus will be on sampling algorithms that preserve the visual property of
ordering; our techniques will also apply to some other visual properties. For
instance, our algorithms can be used to generate an approximate visualization
of a bar chart very rapidly, where the comparisons between any two bars are
correct. We formally show that our sampling algorithms are generally applicable
and provably optimal in theory, in that they do not take more samples than
necessary to generate the visualizations with ordering guarantees. They also
work well in practice, correctly ordering output groups while taking orders of
magnitude fewer samples and much less time than conventional sampling schemes.Comment: Tech Report. 17 pages. Condensed version to appear in VLDB Vol. 8 No.
κ°κ²°ν μλ£κ΅¬μ‘°λ₯Ό νμ©ν λ°κ΅¬μ‘°νλ λ¬Έμ νμλ€μ κ³΅κ° ν¨μ¨μ ννλ²
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2021. 2. Srinivasa Rao Satti.Numerous big data are generated from a plethora of sources. Most of the data stored as files contain a non-fixed type of schema, so that the files are suitable to be maintained as semi-structured document formats. A number of those formats, such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), and YAML (YAML Ain't Markup Language) are suggested to sustain hierarchy in the original corpora of data. Several data models structuring the gathered data - including RDF (Resource Description Framework) - depend on the semi-structured document formats to be serialized and transferred for future processing.
Since the semi-structured document formats focus on readability and verbosity, redundant space is required to organize and maintain the document. Even though general-purpose compression schemes are widely used to compact the documents, applying those algorithms hinder future handling of the corpora, owing to loss of internal structures.
The area of succinct data structures is widely investigated and researched in theory, to provide answers to the queries while the encoded data occupy space close to the information-theoretic lower bound. Bit vectors and trees are the notable succinct data structures. Nevertheless, there were few attempts to apply the idea of succinct data structures to represent the semi-structured documents in space-efficient manner.
In this dissertation we propose a unified, space-efficient representation of various semi-structured document formats. The core functionality of this representation is its compactness and query-ability derived from enriched functions of succinct data structures. Incorporation of (a) bit indexed arrays, (b) succinct ordinal trees, and (c) compression techniques engineers the compact representation. We implement this representation in practice, and show by experiments that construction of this representation decreases the disk usage by up to 60% while occupying 90% less RAM. We also allow processing a document in partial manner, to allow processing of larger corpus of big data even in the constrained environment.
In parallel to establishing the aforementioned compact semi-structured document representation, we provide and reinforce some of the existing compression schemes in this dissertation. We first suggest an idea to encode an array of integers that is not necessarily sorted. This compaction scheme improves upon the existing universal code systems, by assistance of succinct bit vector structure. We show that our suggested algorithm reduces space usage by up to 44% while consuming 15% less time than the original code system, while the algorithm additionally supports random access of elements upon the encoded array.
We also reinforce the SBH bitmap index compression algorithm. The main strength of this scheme is the use of intermediate super-bucket during operations, giving better performance on querying through a combination of compressed bitmap indexes. Inspired from splits done during the intermediate process of the SBH algorithm, we give an improved compression mechanism supporting parallelism that could be utilized in both CPUs and GPUs. We show by experiments that this CPU parallel processing optimization diminishes compression and decompression times by up to 38% in a 4-core machine without modifying the bitmap compressed form. For GPUs, the new algorithm gives 48% faster query processing time in the experiments, compared to the previous existing bitmap index compression schemes.μ
μ μλ λΉ
λ°μ΄ν°κ° λ€μν μλ³Έλ‘λΆν° μμ±λκ³ μλ€. μ΄λ€ λ°μ΄ν°μ λλΆλΆμ κ³ μ λμ§ μμ μ’
λ₯μ μ€ν€λ§λ₯Ό ν¬ν¨ν νμΌ ννλ‘ μ μ₯λλλ°, μ΄λ‘ μΈνμ¬ λ°κ΅¬μ‘°νλ λ¬Έμ νμμ μ΄μ©νμ¬ νμΌμ μ μ§νλ κ²μ΄ μ ν©νλ€. XML, JSON λ° YAMLκ³Ό κ°μ μ’
λ₯μ λ°κ΅¬μ‘°νλ λ¬Έμ νμμ΄ λ°μ΄ν°μ λ΄μ¬νλ ꡬ쑰λ₯Ό μ μ§νκΈ° μνμ¬ μ μλμλ€. μμ§λ λ°μ΄ν°λ₯Ό ꡬ쑰ννλ RDFμ κ°μ μ¬λ¬ λ°μ΄ν° λͺ¨λΈλ€μ μ¬ν μ²λ¦¬λ₯Ό μν μ μ₯ λ° μ μ‘μ μνμ¬ λ°κ΅¬μ‘°νλ λ¬Έμ νμμ μμ‘΄νλ€.
λ°κ΅¬μ‘°νλ λ¬Έμ νμμ κ°λ
μ±κ³Ό λ€λ³μ±μ μ§μ€νκΈ° λλ¬Έμ, λ¬Έμλ₯Ό ꡬ쑰ννκ³ μ μ§νκΈ° μνμ¬ μΆκ°μ μΈ κ³΅κ°μ νμλ‘ νλ€. λ¬Έμλ₯Ό μμΆμν€κΈ° μνμ¬ μΌλ°μ μΈ μμΆ κΈ°λ²λ€μ΄ λ리 μ¬μ©λκ³ μμΌλ, μ΄λ€ κΈ°λ²λ€μ μ μ©νκ² λλ©΄ λ¬Έμμ λ΄λΆ ꡬ쑰μ μμ€λ‘ μΈνμ¬ λ°μ΄ν°μ μ¬ν μ²λ¦¬κ° μ΄λ ΅κ² λλ€.
λ°μ΄ν°λ₯Ό μ 보μ΄λ‘ μ ννμ κ°κΉμ΄ 곡κ°λ§μ μ¬μ©νμ¬ μ μ₯μ κ°λ₯νκ² νλ©΄μ μ§μμ λν μλ΅μ μ 곡νλ κ°κ²°ν μλ£κ΅¬μ‘°λ μ΄λ‘ μ μΌλ‘ λ리 μ°κ΅¬λκ³ μλ λΆμΌμ΄λ€. λΉνΈμ΄κ³Ό νΈλ¦¬κ° λ리 μλ €μ§ κ°κ²°ν μλ£κ΅¬μ‘°λ€μ΄λ€. κ·Έλ¬λ λ°κ΅¬μ‘°νλ λ¬Έμλ€μ μ μ₯νλ λ° κ°κ²°ν μλ£κ΅¬μ‘°μ μμ΄λμ΄λ₯Ό μ μ©ν μ°κ΅¬λ κ±°μ μ§νλμ§ μμλ€.
λ³Έ νμλ
Όλ¬Έμ ν΅ν΄ μ°λ¦¬λ λ€μν μ’
λ₯μ λ°κ΅¬μ‘°νλ λ¬Έμ νμμ ν΅μΌλκ² νννλ κ³΅κ° ν¨μ¨μ ννλ²μ μ μνλ€. μ΄ κΈ°λ²μ μ£Όμν κΈ°λ₯μ κ°κ²°ν μλ£κ΅¬μ‘°κ° κ°μ μΌλ‘ κ°μ§λ νΉμ±μ κΈ°λ°ν κ°κ²°μ±κ³Ό μ§μ κ°λ₯μ±μ΄λ€. λΉνΈμ΄λ‘ μΈλ±μ±λ λ°°μ΄, κ°κ²°ν μμ μλ νΈλ¦¬ λ° λ€μν μμΆ κΈ°λ²μ ν΅ν©νμ¬ ν΄λΉ ννλ²μ κ³ μνμλ€. μ΄ κΈ°λ²μ μ€μ¬μ μΌλ‘ ꡬνλμκ³ , μ€νμ ν΅νμ¬ μ΄ κΈ°λ²μ μ μ©ν λ°κ΅¬μ‘°νλ λ¬Έμλ€μ μ΅λ 60% μ μ λμ€ν¬ 곡κ°κ³Ό 90% μ μ λ©λͺ¨λ¦¬ 곡κ°μ ν΅ν΄ ννλ μ μλ€λ κ²μ 보μΈλ€. λλΆμ΄ λ³Έ νμλ
Όλ¬Έμμ λ°κ΅¬μ‘°νλ λ¬Έμλ€μ λΆν μ μΌλ‘ ννμ΄ κ°λ₯ν¨μ 보μ΄κ³ , μ΄λ₯Ό ν΅νμ¬ μ νλ νκ²½μμλ λΉ
λ°μ΄ν°λ₯Ό ννν λ¬Έμλ€μ μ²λ¦¬ν μ μλ€λ κ²μ 보μΈλ€.
μμ μΈκΈν κ³΅κ° ν¨μ¨μ λ°κ΅¬μ‘°νλ λ¬Έμ ννλ²μ ꡬμΆν¨κ³Ό λμμ, λ³Έ νμλ
Όλ¬Έμμ μ΄λ―Έ μ‘΄μ¬νλ μμΆ κΈ°λ² μ€ μΌλΆλ₯Ό μΆκ°μ μΌλ‘ κ°μ νλ€. 첫째λ‘, λ³Έ νμλ
Όλ¬Έμμλ μ λ ¬ μ¬λΆμ κ΄κ³μλ μ μ λ°°μ΄μ λΆνΈννλ μμ΄λμ΄λ₯Ό μ μνλ€. μ΄ κΈ°λ²μ μ΄λ―Έ μ‘΄μ¬νλ λ²μ© μ½λ μμ€ν
μ κ°μ ν ννλ‘, κ°κ²°ν λΉνΈμ΄ μλ£κ΅¬μ‘°λ₯Ό μ΄μ©νλ€. μ μλ μκ³ λ¦¬μ¦μ κΈ°μ‘΄ λ²μ© μ½λ μμ€ν
μ λΉν΄ μ΅λ 44\% μ μ 곡κ°μ μ¬μ©ν λΏλ§ μλλΌ 15\% μ μ λΆνΈν μκ°μ νμλ‘ νλ©°, κΈ°μ‘΄ μμ€ν
μμ μ 곡νμ§ μλ λΆνΈνλ λ°°μ΄μμμ μμ μ κ·Όμ μ§μνλ€.
λν λ³Έ νμλ
Όλ¬Έμμλ λΉνΈλ§΅ μΈλ±μ€ μμΆμ μ¬μ©λλ SBH μκ³ λ¦¬μ¦μ κ°μ μν¨λ€. ν΄λΉ κΈ°λ²μ μ£Όλ κ°μ μ λΆνΈνμ 볡νΈν μ§ν μ μ€κ° 맀κ°μΈ μνΌλ²μΌμ μ¬μ©ν¨μΌλ‘μ¨ μ¬λ¬ μμΆλ λΉνΈλ§΅ μΈλ±μ€μ λν μ§μ μ±λ₯μ κ°μ μν€λ κ²μ΄λ€. μ μμΆ μκ³ λ¦¬μ¦μ μ€κ° κ³Όμ μμ μ§νλλ λΆν μμ μκ°μ μ»μ΄, λ³Έ νμλ
Όλ¬Έμμ CPU λ° GPUμ μ μ© κ°λ₯ν κ°μ λ λ³λ ¬ν μμΆ λ§€μ»€λμ¦μ μ μνλ€. μ€νμ ν΅ν΄ CPU λ³λ ¬ μ΅μ νκ° μ΄λ£¨μ΄μ§ μκ³ λ¦¬μ¦μ μμΆλ ννμ λ³ν μμ΄ 4μ½μ΄ μ»΄ν¨ν°μμ μ΅λ 38\%μ μμΆ λ° ν΄μ μκ°μ κ°μμν¨λ€λ κ²μ 보μΈλ€. GPU λ³λ ¬ μ΅μ νλ κΈ°μ‘΄μ μ‘΄μ¬νλ GPU λΉνΈλ§΅ μμΆ κΈ°λ²μ λΉν΄ 48\% λΉ λ₯Έ μ§μ μ²λ¦¬ μκ°μ νμλ‘ ν¨μ νμΈνλ€.Chapter 1 Introduction 1
1.1 Contribution 3
1.2 Organization 5
Chapter 2 Background 6
2.1 Model of Computation 6
2.2 Succinct Data Structures 7
Chapter 3 Space-efficient Representation of Integer Arrays 9
3.1 Introduction 9
3.2 Preliminaries 10
3.2.1 Universal Code System 10
3.2.2 Bit Vector 13
3.3 Algorithm Description 13
3.3.1 Main Principle 14
3.3.2 Optimization in the Implementation 16
3.4 Experimental Results 16
Chapter 4 Space-efficient Parallel Compressed Bitmap Index Processing 19
4.1 Introduction 19
4.2 Related Work 23
4.2.1 Byte-aligned Bitmap Code (BBC) 24
4.2.2 Word-Aligned Hybrid (WAH) 27
4.2.3 WAH-derived Algorithms 28
4.2.4 GPU-based WAH Algorithms 31
4.2.5 Super Byte-aligned Hybrid (SBH) 33
4.3 Parallelizing SBH 38
4.3.1 CPU Parallelism 38
4.3.2 GPU Parallelism 39
4.4 Experimental Results 40
4.4.1 Plain Version 41
4.4.2 Parallelized Version 46
4.4.3 Summary 49
Chapter 5 Space-efficient Representation of Semi-structured Document Formats 50
5.1 Preliminaries 50
5.1.1 Semi-structured Document Formats 50
5.1.2 Resource Description Framework 57
5.1.3 Succinct Ordinal Tree Representations 60
5.1.4 String Compression Schemes 64
5.2 Representation 66
5.2.1 Bit String Indexed Array 67
5.2.2 Main Structure 68
5.2.3 Single Document as a Collection of Chunks 72
5.2.4 Supporting Queries 73
5.3 Experimental Results 75
5.3.1 Datasets 76
5.3.2 Construction Time 78
5.3.3 RAM Usage during Construction 80
5.3.4 Disk Usage and Serialization Time 83
5.3.5 Chunk Division 83
5.3.6 String Compression 88
5.3.7 Query Time 89
Chapter 6 Conclusion 94
Bibliography 96
μμ½ 109
Acknowledgements 111Docto
- β¦