25 research outputs found
FLiMS: a fast lightweight 2-way merger for sorting
In this paper, we present FLiMS, a highly-efficient and simple parallel algorithm for merging two sorted lists residing in banked and/or wide memory. On FPGAs, its implementation uses fewer hardware resources than the state-of-the-art alternatives, due to the reduced number of comparators and elimination of redundant logic found on prior attempts. In combination with the distributed nature of the selector stage, a higher performance is achieved for the same amount of parallelism or higher. This is useful in many applications such as in parallel merge trees to achieve high-throughput sorting, where the resource utilisation of the merger is critical for building larger trees and internalising the workload for faster computation. Also presented are efficient variations of FLiMS for optimizing throughput for skewed datasets, achieving stable sorting or using fewer dequeue signals. FLiMS is also shown to perform well as conventional software on modern CPUs supporting single-instruction multiple-data (SIMD) instructions, surpassing the performance of some standard libraries for sorting
Database System Acceleration on FPGAs
Relational database systems provide various services and applications with an efficient means for storing, processing, and retrieving their data. The performance of these systems has a direct impact on the quality of service of the applications that rely on them. Therefore, it is crucial that database systems are able to adapt and grow in tandem with the demands of these applications, ensuring that their performance scales accordingly. In the past, Moore's law and algorithmic advancements have been sufficient to meet these demands. However, with the slowdown of Moore's law, researchers have begun exploring alternative methods, such as application-specific technologies, to satisfy the more challenging performance requirements. One such technology is field-programmable gate arrays (FPGAs), which provide ideal platforms for developing and running custom architectures for accelerating database systems.
The goal of this thesis is to develop a domain-specific architecture that can enhance the performance of in-memory database systems when executing analytical queries. Our research is guided by a combination of academic and industrial requirements that seek to strike a balance between generality and performance. The former ensures that our platform can be used to process a diverse range of workloads, while the latter makes it an attractive solution for high-performance use cases.
Throughout this thesis, we present the development of a system-on-chip for database system acceleration that meets our requirements. The resulting architecture, called CbMSMK, is capable of processing the projection, sort, aggregation, and equi-join database operators and can also run some complex TPC-H queries. CbMSMK employs a shared sort-merge pipeline for executing all these operators, which results in an efficient use of FPGA resources. This approach enables the instantiation of multiple acceleration cores on the FPGA, allowing it to serve multiple clients simultaneously. CbMSMK can process both arbitrarily deep and wide tables efficiently. The former is achieved through the use of the sort-merge algorithm which utilizes the FPGA RAM for buffering intermediate sort results. The latter is achieved through the use of KeRRaS, a novel variant of the forward radix sort algorithm introduced in this thesis. KeRRaS allows CbMSMK to process a table a few columns at a time, incrementally generating the final result through multiple iterations. Given that acceleration is a key objective of our work, CbMSMK benefits from many performance optimizations. For instance, multi-way merging is employed to reduce the number of merge passes required for the execution of the sort-merge algorithm, thus improving the performance of all our pipeline-breaking operators. Another example is our in-depth analysis of early aggregation, which led to the development of a novel cache-based algorithm that significantly enhances aggregation performance. Our experiments demonstrate that CbMSMK performs on average 5 times faster than the state-of-the-art CPU-based database management system MonetDB.:I Database Systems & FPGAs
1 INTRODUCTION
1.1 Databases & the Importance of Performance
1.2 Accelerators & FPGAs
1.3 Requirements
1.4 Outline & Summary of Contributions
2 BACKGROUND ON DATABASE SYSTEMS
2.1 Databases
2.1.1 Storage Model
2.1.2 Storage Medium
2.2 Database Operators
2.2.1 Projection
2.2.2 Filter
2.2.3 Sort
2.2.4 Aggregation
2.2.5 Join
2.2.6 Operator Classification
2.3 Database Queries
2.4 Impact of Acceleration
3 BACKGROUND ON FPGAS
3.1 FPGA
3.1.1 Logic Element
3.1.2 Block RAM (BRAM)
3.1.3 Digital Signal Processor (DSP)
3.1.4 IO Element
3.1.5 Programmable Interconnect
3.2 FPGADesignFlow
3.2.1 Specifications
3.2.2 RTL Description
3.2.3 Verification
3.2.4 Synthesis, Mapping, Placement, and Routing
3.2.5 TimingAnalysis
3.2.6 Bitstream Generation and FPGA Programming
3.3 Implementation Quality Metrics
3.4 FPGA Cards
3.5 Benefits of Using FPGAs
3.6 Challenges of Using FPGAs
4 RELATED WORK
4.1 Summary of Related Work
4.2 Platform Type
4.2.1 Accelerator Card
4.2.2 Coprocessor
4.2.3 Smart Storage
4.2.4 Network Processor
4.3 Implementation
4.3.1 Loop-based implementation
4.3.2 Sort-based Implementation
4.3.3 Hash-based Implementation
4.3.4 Mixed Implementation
4.4 A Note on Quantitative Performance Comparisons
II Cache-Based Morphing Sort-Merge with KeRRaS (CbMSMK)
5 OBJECTIVES AND ARCHITECTURE OVERVIEW
5.1 From Requirements to Objectives
5.2 Architecture Overview
5.3 Outlineof Part II
6 COMPARATIVE ANALYSIS OF OPENCL AND RTL FOR SORT-MERGE PRIMITIVES ON FPGAS
6.1 Programming FPGAs
6.2 RelatedWork
6.3 Architecture
6.3.1 Global Architecture
6.3.2 Sorter Architecture
6.3.3 Merger Architecture
6.3.4 Scalability and Resource Adaptability
6.4 Experiments
6.4.1 OpenCL Sort-Merge Implementation
6.4.2 RTLSorters
6.4.3 RTLMergers
6.4.4 Hybrid OpenCL-RTL Sort-Merge Implementation
6.5 Summary & Discussion
7 RESOURCE-EFFICIENT ACCELERATION OF PIPELINE-BREAKING DATABASE OPERATORS ON FPGAS
7.1 The Case for Resource Efficiency
7.2 Related Work
7.3 Architecture
7.3.1 Sorters
7.3.2 Sort-Network
7.3.3 X:Y Mergers
7.3.4 Merge-Network
7.3.5 Join Materialiser (JoinMat)
7.4 Experiments
7.4.1 Experimental Setup
7.4.2 Implementation Description & Tuning
7.4.3 Sort Benchmarks
7.4.4 Aggregation Benchmarks
7.4.5 Join Benchmarks
7. Summary
8 KERRAS: COLUMN-ORIENTED WIDE TABLE PROCESSING ON FPGAS
8.1 The Scope of Database System Accelerators
8.2 Related Work
8.3 Key-Reduce Radix Sort(KeRRaS)
8.3.1 Time Complexity
8.3.2 Space Complexity (Memory Utilization)
8.3.3 Discussion and Optimizations
8.4 Architecture
8.4.1 MSM
8.4.2 MSMK: Extending MSM with KeRRaS
8.4.3 Payload, Aggregation and Join Processing
8.4.4 Limitations
8.5 Experiments
8.5.1 Experimental Setup
8.5.2 Datasets
8.5.3 MSMK vs. MSM
8.5.4 Payload-Less Benchmarks
8.5.5 Payload-Based Benchmarks
8.5.6 Flexibility
8.6 Summary
9 A STUDY OF EARLY AGGREGATION IN DATABASE QUERY PROCESSING ON FPGAS
9.1 Early Aggregation
9.2 Background & Related Work
9.2.1 Sort-Based Early Aggregation
9.2.2 Cache-Based Early Aggregation
9.3 Simulations
9.3.1 Datasets
9.3.2 Metrics
9.3.3 Sort-Based Versus Cache-Based Early Aggregation
9.3.4 Comparison of Set-Associative Caches
9.3.5 Comparison of Cache Structures
9.3.6 Comparison of Replacement Policies
9.3.7 Cache Selection Methodology
9.4 Cache System Architecture
9.4.1 Window Aggregator
9.4.2 Compressor & Hasher
9.4.3 Collision Detector
9.4.4 Collision Resolver
9.4.5 Cache
9.5 Experiments
9.5.1 Experimental Setup
9.5.2 Resource Utilization and Parameter Tuning
9.5.3 Datasets
9.5.4 Benchmarks on Synthetic Data
9.5.5 Benchmarks on Real Data
9.6 Summary
10 THE FULL PICTURE
10.1 System Architecture
10.2 Benchmarks
10.3 Meeting the Objectives
III Conclusion
11 SUMMARY AND OUTLOOK ON FUTURE RESEARCH
11.1 Summary
11.2 Future Work
BIBLIOGRAPHY
LIST OF FIGURES
LIST OF TABLE
Compact GF(2) systemizer and optimized constant-time hardware sorters for Key Generation in Classic McEliece
Classic McEliece is a code-based quantum-resistant public-key scheme characterized with relative high encapsulation/decapsulation speed and small cipher- texts, with an in-depth analysis on its security. However, slow key generation with large public key size make it hard for wider applications. Based on this observation, a high-throughput key generator in hardware, is proposed to accelerate the key generation in Classic McEliece based on algorithm-hardware co-design. Meanwhile the storage overhead caused by large-size keys is also minimized. First, compact large-size GF(2) Gauss elimination is presented by adopting naive processing array, singular matrix detection-based early abort, and memory-friendly scheduling strategy. Second, an optimized constant-time hardware sorter is proposed to support regular memory accesses with less comparators and storage. Third, algorithm-level pipeline is enabled for high-throughput processing, allowing for concurrent key generation based on decoupling between data access and computation
AxleDB: A novel programmable query processing platform on FPGA
With the rise of Big Data, providing high-performance query processing capabilities through the acceleration of the database analytic has gained significant attention. Leveraging Field Programmable Gate Array (FPGA) technology, this approach can lead to clear benefits. In this work, we present the design and implementation of AxleDB: An FPGA-based platform that enables fast query processing for database systems by melding novel database-specific accelerators with commercial-off-the-shelf (COTS) storage using modern interfaces, in a novel, unified, and a programmable environment. AxleDB can perform a large subset of SQL queries through its set of instructions that can map compute-intensive database operations, such as filter, arithmetic, aggregate, group by, table join, or sort, on to the specialized high-throughput accelerators. To minimize the amount of SSD I/O operations required, AxleDB also supports hardware MinMax indexing for databases. We evaluated AxleDB with five decision support queries from the TPC-H benchmark suite and achieved a speedup from 1.8X to 34.2X and energy efficiency from 2.8X to 62.1X, in comparison to the state-of-the-art DBMS, i.e., PostgreSQL and MonetDB.The research leading to these results has received funding from the European Union Seventh Framework Program (FP7) (under the AXLE project GA number 318633), the Ministry of Economy and Competitiveness
of Spain (under contract number TIN2015-65316-p), Turkish Ministry of Development TAM Project (number 2007K120610), and Bogazici University Scientific Projects (number 7060).Peer ReviewedPostprint (author's final draft
Recommended from our members
Accelerating Similarly Structured Data
The failure of Dennard scaling [Bohr, 2007] and the rapid growth of data produced and consumed daily [NetApp, 2012] have made mitigating the dark silicon phenomena [Esmaeilzadeh et al., 2011] and providing fast computation for processing large volumes and expansive variety of data while consuming minimal energy the utmost important challenges for modern computer architecture. This thesis introduces the concept that grouping data structures that are previously defined in software and processing them with an accelerator can significantly improve the application performance and energy efficiency.
To measure the potential performance benefits of this hypothesis, this research starts out by examining the cache impacts on accelerating commonly used data structures and its applicability to popular benchmarks. We found that accelerating similarly structured data can provide substantial benefits, however, most popular benchmark suites do not contain shared acceleration targets and therefore cannot obtain significant performance or energy improvements via a handful of accelerators. To further examine this hypothesis in an environment where the common data structures are widely used, we choose to target database application domain, using tables and columns as the similarly structured data, accelerating the processing of such data, and evaluate the performance and energy efficiency. Given that data partitioning is widely used for database applications to improve cache locality, we architect and design a streaming data partitioning accelerator to assess the feasibility of big data acceleration. The results show that we are able to achieve an order of magnitude improvement in partitioning performance and energy. To improve upon the present ad-hoc communications between accelerators and general-purpose processors [Vo et al., 2013], we also architect and evaluate a streaming framework that can be used for the data parti- tioner and other streaming accelerators alike. The streaming framework can provide at least 5 GB/s per stream per thread using software control, and is able to elegantly handle interrupts and context switches using a simple save/restore. As a final evaluation of this hypothesis, we architect a class of domain-specific database processors, or Database Processing Units (DPUs), to further improve the performance and energy efficiency of database applications. As a case study, we design and implement one DPU, called Q100, to execute industry standard analytic database queries. Despite Q100's sensitivity to communication bandwidth on-chip and off-chip, we find that the low-power configuration of Q100 is able to provide three orders of magnitude in energy efficiency over a state of the art software Database Management System (DBMS), while the high-performance configuration is able to outperform the same DBMS by 70X.
Based on these experiments, we conclude that grouping similarly structured data and processing it with accelerators vastly improve application performance and energy efficiency for a given application domain. This is primarily due to the fact that creating specialized encapsulated instruction and data accesses and datapaths allows us to mitigate unnecessary data movement, take advantage of data and pipeline parallelism, and consequently provide substantial energy savings while obtaining significant performance gains
Preliminary multicore architecture for Introspective Computing
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 243-245).This thesis creates a framework for Introspective Computing. Introspective Computing is a computing paradigm characterized by self-aware software. Self-aware software systems use hardware mechanisms to observe an application's execution so that they may adapt execution to improve performance, reduce power consumption, or balance user-defined fitness criteria over time-varying conditions in a system environment. We dub our framework Partner Cores. The Partner Cores framework builds upon tiled multicore architectures [11, 10, 25, 9], closely coupling cores such that one may be used to observe and optimize execution in another. Partner cores incrementally collect and analyze execution traces from code cores then exploit knowledge of the hardware to optimize execution. This thesis develops a tiled architecture for the Partner Cores framework that we dub Evolve. Evolve provides a versatile substrate upon which software may coordinate core partnerships and various forms of parallelism. To do so, Evolve augments a basic tiled architecture with introspection hardware and programmable functional units. Partner Cores software systems on the Evolve hardware may follow the style of helper threading [13, 12, 6] or utilize the programmable functional units in each core to evolve application-specific coprocessor engines. This thesis work develops two Partner Cores software systems: the Dynamic Partner-Assisted Branch Predictor and the Introspective L2 Memory System (IL2). The branch predictor employs a partner core as a coprocessor engine for general dynamic branch prediction in a corresponding code core. The IL2 retasks the silicon resources of partner cores as banks of an on-chip, distributed, software L2 cache for code cores.(cont.) The IL2 employs aggressive, application-specific prefetchers for minimizing cache miss penalties and DRAM power consumption. Our results and future work show that the branch predictor is able to sustain prediction for code core branch frequencies as high as one every 7 instructions with no degradation in accuracy; updated prediction directions are available in a low minimum of 20-21 instructions. For the IL2, we develop a pixel block prefetcher for the image data structure used in a JPEG encoder benchmark and show that a 50% improvement in absolute performance is attainable.by Jonathan M. Eastep.S.M
Synthesis Techniques for Semi-Custom Dynamically Reconfigurable Superscalar Processors
The accelerated adoption of reconfigurable computing foreshadows a computational paradigm shift, aimed at fulfilling the need of customizable yet high-performance flexible hardware. Reconfigurable computing fulfills this need by allowing the physical resources of a chip to be adapted to the computational requirements of a specific program, thus achieving higher levels of computing performance. This dissertation evaluates the area requirements for reconfigurable processing, an important yet often disregarded assessment for partial reconfiguration. Common reconfigurable computing approaches today attempt to create custom circuitry in static co-processor accelerators. We instead focused on a new approach that synthesized semi-custom general-purpose processor cores. Each superscalar processor core's execution units can be customized for a particular application, yet the processor retains its standard microprocessor interface. We analyzed the area consumption for these computational components by studying the synthesis requirements of different processor configurations. This area/performance assessment aids designers when constraining processing elements in a fixed-size area slot, a requirement for modern partial reconfiguration approaches. Our results provide a more deterministic evaluation of performance density, hence making the area cost analysis less ambiguous when optimizing dynamic systems for coarse-grained parallelism. The results obtained showed that even though performance density decreases with processor complexity, the additional area still provides a positive contribution to the aggregate parallel processing performance. This evaluation of parallel execution density contributes to ongoing efforts in the field of reconfigurable computing by providing a baseline for area/performance trade-offs for partial reconfiguration and multi-processor systems