13 research outputs found
Designing a parallel suffix sort
Suffix sort plays a critical role in various computational algorithms
including genomics as well as in frequently used day to day software
applications. The sorting algorithm becomes tricky when we have lot of repeated
characters in the string for a given radix. Various innovative implementations
are available in this area e.g., Manber Myers. We present here an analysis that
uses a concept around generalized polynomial factorization to sort these
suffixes. The initial generation of these substring specific polynomial can be
efficiently done using parallel threads and shared memory. The set of distinct
factors and their order are known beforehand, and this helps us to sort the
polynomials (equivalent of strings) accordingly
Time-Space Trade-Offs for Lempel-Ziv Compressed Indexing
Given a string S, the compressed indexing problem is to preprocess S into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of S while supporting fast queries. We present a compressed index based on the Lempel-Ziv 1977 compression scheme. Let n, and z denote the size of the input string, and the compressed LZ77 string, respectively. We obtain the following time-space trade-offs. Given a pattern string P of length m, we can solve the problem in
(i) O(m + occ lglg n) time using O(z lg(n/z) lglg z) space, or
(ii) O(m(1 + lg^e z / lg(n/z)) + occ(lglg n + lg^e z)) time using O(z lg(n/z)) space, for any 0 < e < 1
In particular, (i) improves the leading term in the query time of the previous best solution from O(m lg m) to O(m) at the cost of increasing the space by a factor lglg z. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1+lg^e z / lg(n/z))). However, for any polynomial compression ratio, i.e., z = O(n^{1-d}), for constant d > 0, this becomes O(m). Our index also supports extraction of any substring of length l in O(l + lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search
Time-space trade-offs for lempel-ziv compressed indexing
Given a string , the \emph{compressed indexing problem} is to preprocess
into a compressed representation that supports fast \emph{substring
queries}. The goal is to use little space relative to the compressed size of
while supporting fast queries. We present a compressed index based on the
Lempel--Ziv 1977 compression scheme. We obtain the following time-space
trade-offs: For constant-sized alphabets; (i) time using
space, or (ii) time using space. For integer
alphabets polynomially bounded by ; (iii) time using space, or (iv) time using
space, where and are the length of
the input string and query string respectively, is the number of phrases in
the LZ77 parse of the input string, is the number of occurrences of the
query in the input and is an arbitrarily small constant. In
particular, (i) improves the leading term in the query time of the previous
best solution from to at the cost of increasing the space by
a factor . Alternatively, (ii) matches the previous best space
bound, but has a leading term in the query time of . However, for any polynomial compression ratio, i.e., , for constant , this becomes . Our index
also supports extraction of any substring of length in time. Technically, our results are obtained by novel extensions and
combinations of existing data structures of independent interest, including a
new batched variant of weak prefix search
Acceleration of k-Nearest Neighbor and SRAD Algorithms Using Intel FPGA SDK for OpenCL
Field Programmable Gate Arrays (FPGAs) have been widely used for accelerating machine learning algorithms. However, the high design cost and time for implementing FPGA-based accelerators using traditional HDL-based design methodologies has discouraged users from designing FPGA-based accelerators. In recent years, a new CAD tool called Intel FPGA SDK for OpenCL (IFSO) allowed fast and efficient design of FPGA-based hardware accelerators from high level specification such as OpenCL. Even software engineers with basic hardware design knowledge could design FPGA-based accelerators. In this thesis, IFSO has been used to explore acceleration of k-Nearest-Neighbour (kNN) algorithm and Speckle Reducing Anisotropic Diffusion (SRAD) simulation using FPGAs. kNN is a popular algorithm used in machine learning. Bitonic sorting and radix sorting algorithms were used in the kNN algorithm to check if these provide any performance improvements. Acceleration of SRAD simulation was also explored. The experimental results obtained for these algorithms from FPGA-based acceleration were compared with the state of the art CPU implementation. The optimized algorithms were implemented on two different FPGAs (Intel Stratix A7 and Intel Arria 10 GX). Experimental results show that the FPGA-based accelerators provided similar or better execution time (up to 80X) and better power efficiency (75% reduction in power consumption) than traditional platforms such as a workstation based on two Intel Xeon processors E5-2620 Series (each with 6 cores and running at 2.4 GHz)
FlexQueue: Simple and Efficient Priority Queue for System Software
Existing studies of priority queue implementations often focus on improving canonical operations such as insert and deleteMin, while sacrificing design simplicity and pre- dictable worst-case latency. Design simplicity is sacrificed as the algorithm becomes more and more optimized, taking into account characteristics of the input workload distribution. Predictable worst-case latency is sacrificed when operations such as memory allocation and structural re-organization are deferred until absolutely necessary. While these techniques often yield performance improvement to some degree, it is possible to take a step back and ask a more basic question: is it possible to achieve similar performance while retaining a simple design? By combining techniques such as hierarchical bit-vector and dynamic horizon resizing, all of which are straight-forward in principle, this thesis presents a new priority queue design called FlexQueue, that answers this question with a definitive âyesâ
Database System Acceleration on FPGAs
Relational database systems provide various services and applications with an efficient means for storing, processing, and retrieving their data. The performance of these systems has a direct impact on the quality of service of the applications that rely on them. Therefore, it is crucial that database systems are able to adapt and grow in tandem with the demands of these applications, ensuring that their performance scales accordingly. In the past, Moore's law and algorithmic advancements have been sufficient to meet these demands. However, with the slowdown of Moore's law, researchers have begun exploring alternative methods, such as application-specific technologies, to satisfy the more challenging performance requirements. One such technology is field-programmable gate arrays (FPGAs), which provide ideal platforms for developing and running custom architectures for accelerating database systems.
The goal of this thesis is to develop a domain-specific architecture that can enhance the performance of in-memory database systems when executing analytical queries. Our research is guided by a combination of academic and industrial requirements that seek to strike a balance between generality and performance. The former ensures that our platform can be used to process a diverse range of workloads, while the latter makes it an attractive solution for high-performance use cases.
Throughout this thesis, we present the development of a system-on-chip for database system acceleration that meets our requirements. The resulting architecture, called CbMSMK, is capable of processing the projection, sort, aggregation, and equi-join database operators and can also run some complex TPC-H queries. CbMSMK employs a shared sort-merge pipeline for executing all these operators, which results in an efficient use of FPGA resources. This approach enables the instantiation of multiple acceleration cores on the FPGA, allowing it to serve multiple clients simultaneously. CbMSMK can process both arbitrarily deep and wide tables efficiently. The former is achieved through the use of the sort-merge algorithm which utilizes the FPGA RAM for buffering intermediate sort results. The latter is achieved through the use of KeRRaS, a novel variant of the forward radix sort algorithm introduced in this thesis. KeRRaS allows CbMSMK to process a table a few columns at a time, incrementally generating the final result through multiple iterations. Given that acceleration is a key objective of our work, CbMSMK benefits from many performance optimizations. For instance, multi-way merging is employed to reduce the number of merge passes required for the execution of the sort-merge algorithm, thus improving the performance of all our pipeline-breaking operators. Another example is our in-depth analysis of early aggregation, which led to the development of a novel cache-based algorithm that significantly enhances aggregation performance. Our experiments demonstrate that CbMSMK performs on average 5 times faster than the state-of-the-art CPU-based database management system MonetDB.:I Database Systems & FPGAs
1 INTRODUCTION
1.1 Databases & the Importance of Performance
1.2 Accelerators & FPGAs
1.3 Requirements
1.4 Outline & Summary of Contributions
2 BACKGROUND ON DATABASE SYSTEMS
2.1 Databases
2.1.1 Storage Model
2.1.2 Storage Medium
2.2 Database Operators
2.2.1 Projection
2.2.2 Filter
2.2.3 Sort
2.2.4 Aggregation
2.2.5 Join
2.2.6 Operator Classification
2.3 Database Queries
2.4 Impact of Acceleration
3 BACKGROUND ON FPGAS
3.1 FPGA
3.1.1 Logic Element
3.1.2 Block RAM (BRAM)
3.1.3 Digital Signal Processor (DSP)
3.1.4 IO Element
3.1.5 Programmable Interconnect
3.2 FPGADesignFlow
3.2.1 Specifications
3.2.2 RTL Description
3.2.3 Verification
3.2.4 Synthesis, Mapping, Placement, and Routing
3.2.5 TimingAnalysis
3.2.6 Bitstream Generation and FPGA Programming
3.3 Implementation Quality Metrics
3.4 FPGA Cards
3.5 Benefits of Using FPGAs
3.6 Challenges of Using FPGAs
4 RELATED WORK
4.1 Summary of Related Work
4.2 Platform Type
4.2.1 Accelerator Card
4.2.2 Coprocessor
4.2.3 Smart Storage
4.2.4 Network Processor
4.3 Implementation
4.3.1 Loop-based implementation
4.3.2 Sort-based Implementation
4.3.3 Hash-based Implementation
4.3.4 Mixed Implementation
4.4 A Note on Quantitative Performance Comparisons
II Cache-Based Morphing Sort-Merge with KeRRaS (CbMSMK)
5 OBJECTIVES AND ARCHITECTURE OVERVIEW
5.1 From Requirements to Objectives
5.2 Architecture Overview
5.3 Outlineof Part II
6 COMPARATIVE ANALYSIS OF OPENCL AND RTL FOR SORT-MERGE PRIMITIVES ON FPGAS
6.1 Programming FPGAs
6.2 RelatedWork
6.3 Architecture
6.3.1 Global Architecture
6.3.2 Sorter Architecture
6.3.3 Merger Architecture
6.3.4 Scalability and Resource Adaptability
6.4 Experiments
6.4.1 OpenCL Sort-Merge Implementation
6.4.2 RTLSorters
6.4.3 RTLMergers
6.4.4 Hybrid OpenCL-RTL Sort-Merge Implementation
6.5 Summary & Discussion
7 RESOURCE-EFFICIENT ACCELERATION OF PIPELINE-BREAKING DATABASE OPERATORS ON FPGAS
7.1 The Case for Resource Efficiency
7.2 Related Work
7.3 Architecture
7.3.1 Sorters
7.3.2 Sort-Network
7.3.3 X:Y Mergers
7.3.4 Merge-Network
7.3.5 Join Materialiser (JoinMat)
7.4 Experiments
7.4.1 Experimental Setup
7.4.2 Implementation Description & Tuning
7.4.3 Sort Benchmarks
7.4.4 Aggregation Benchmarks
7.4.5 Join Benchmarks
7. Summary
8 KERRAS: COLUMN-ORIENTED WIDE TABLE PROCESSING ON FPGAS
8.1 The Scope of Database System Accelerators
8.2 Related Work
8.3 Key-Reduce Radix Sort(KeRRaS)
8.3.1 Time Complexity
8.3.2 Space Complexity (Memory Utilization)
8.3.3 Discussion and Optimizations
8.4 Architecture
8.4.1 MSM
8.4.2 MSMK: Extending MSM with KeRRaS
8.4.3 Payload, Aggregation and Join Processing
8.4.4 Limitations
8.5 Experiments
8.5.1 Experimental Setup
8.5.2 Datasets
8.5.3 MSMK vs. MSM
8.5.4 Payload-Less Benchmarks
8.5.5 Payload-Based Benchmarks
8.5.6 Flexibility
8.6 Summary
9 A STUDY OF EARLY AGGREGATION IN DATABASE QUERY PROCESSING ON FPGAS
9.1 Early Aggregation
9.2 Background & Related Work
9.2.1 Sort-Based Early Aggregation
9.2.2 Cache-Based Early Aggregation
9.3 Simulations
9.3.1 Datasets
9.3.2 Metrics
9.3.3 Sort-Based Versus Cache-Based Early Aggregation
9.3.4 Comparison of Set-Associative Caches
9.3.5 Comparison of Cache Structures
9.3.6 Comparison of Replacement Policies
9.3.7 Cache Selection Methodology
9.4 Cache System Architecture
9.4.1 Window Aggregator
9.4.2 Compressor & Hasher
9.4.3 Collision Detector
9.4.4 Collision Resolver
9.4.5 Cache
9.5 Experiments
9.5.1 Experimental Setup
9.5.2 Resource Utilization and Parameter Tuning
9.5.3 Datasets
9.5.4 Benchmarks on Synthetic Data
9.5.5 Benchmarks on Real Data
9.6 Summary
10 THE FULL PICTURE
10.1 System Architecture
10.2 Benchmarks
10.3 Meeting the Objectives
III Conclusion
11 SUMMARY AND OUTLOOK ON FUTURE RESEARCH
11.1 Summary
11.2 Future Work
BIBLIOGRAPHY
LIST OF FIGURES
LIST OF TABLE
Recommended from our members
Higher Compression from the Burrows-Wheeler Transform with New Algorithms for the List Update Problem
Burrows-Wheeler compression is a three stage process in which the data is transformed with the Burrows-Wheeler Transform, then transformed with Move-To-Front, and finally encoded with an entropy coder. Move-To-Front, Transpose, and Frequency Count are some of the many algorithms used on the List Update problem. In 1985, Competitive Analysis first showed the superiority of Move-To-Front over Transpose and Frequency Count for the List Update problem with arbitrary data. Earlier studies due to Bitner assumed independent identically distributed data, and showed that while Move-To-Front adapts to a distribution faster, incurring less overwork, the asymptotic costs of Frequency Count and Transpose are less. The improvements to Burrows-Wheeler compression this work covers are increases in the amount, not speed, of compression. Best x of 2x-1 is a new family of algorithms created to improve on Move-To-Front's processing of the output of the Burrows-Wheeler Transform which is like piecewise independent identically distributed data. Other algorithms for both the middle stage of Burrows-Wheeler compression and the List Update problem for which overwork, asymptotic cost, and competitive ratios are also analyzed are several variations of Move One From Front and part of the randomized algorithm Timestamp. The Best x of 2x - 1 family includes Move-To-Front, the part of Timestamp of interest, and Frequency Count. Lastly, a greedy choosing scheme, Snake, switches back and forth as the amount of compression that two List Update algorithms achieves fluctuates, to increase overall compression. The Burrows-Wheeler Transform is based on sorting of contexts. The other improvements are better sorting orders, such as âaeioubcdf...â instead of standard alphabetical âabcdefghi...â on English text data, and an algorithm for computing orders for any data, and Gray code sorting instead of standard sorting. Both techniques lessen the overwork incurred by whatever List Update algorithms are used by reducing the difference between adjacent sorted contexts
Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools
This dissertation focuses on two fundamental sorting problems: string sorting
and suffix sorting. The first part considers parallel string sorting on
shared-memory multi-core machines, the second part external memory suffix
sorting using the induced sorting principle, and the third part distributed
external memory suffix sorting with a new distributed algorithmic big data
framework named Thrill.Comment: 396 pages, dissertation, Karlsruher Instituts f\"ur Technologie
(2018). arXiv admin note: text overlap with arXiv:1101.3448 by other author