38 research outputs found
Database System Acceleration on FPGAs
Relational database systems provide various services and applications with an efficient means for storing, processing, and retrieving their data. The performance of these systems has a direct impact on the quality of service of the applications that rely on them. Therefore, it is crucial that database systems are able to adapt and grow in tandem with the demands of these applications, ensuring that their performance scales accordingly. In the past, Moore's law and algorithmic advancements have been sufficient to meet these demands. However, with the slowdown of Moore's law, researchers have begun exploring alternative methods, such as application-specific technologies, to satisfy the more challenging performance requirements. One such technology is field-programmable gate arrays (FPGAs), which provide ideal platforms for developing and running custom architectures for accelerating database systems.
The goal of this thesis is to develop a domain-specific architecture that can enhance the performance of in-memory database systems when executing analytical queries. Our research is guided by a combination of academic and industrial requirements that seek to strike a balance between generality and performance. The former ensures that our platform can be used to process a diverse range of workloads, while the latter makes it an attractive solution for high-performance use cases.
Throughout this thesis, we present the development of a system-on-chip for database system acceleration that meets our requirements. The resulting architecture, called CbMSMK, is capable of processing the projection, sort, aggregation, and equi-join database operators and can also run some complex TPC-H queries. CbMSMK employs a shared sort-merge pipeline for executing all these operators, which results in an efficient use of FPGA resources. This approach enables the instantiation of multiple acceleration cores on the FPGA, allowing it to serve multiple clients simultaneously. CbMSMK can process both arbitrarily deep and wide tables efficiently. The former is achieved through the use of the sort-merge algorithm which utilizes the FPGA RAM for buffering intermediate sort results. The latter is achieved through the use of KeRRaS, a novel variant of the forward radix sort algorithm introduced in this thesis. KeRRaS allows CbMSMK to process a table a few columns at a time, incrementally generating the final result through multiple iterations. Given that acceleration is a key objective of our work, CbMSMK benefits from many performance optimizations. For instance, multi-way merging is employed to reduce the number of merge passes required for the execution of the sort-merge algorithm, thus improving the performance of all our pipeline-breaking operators. Another example is our in-depth analysis of early aggregation, which led to the development of a novel cache-based algorithm that significantly enhances aggregation performance. Our experiments demonstrate that CbMSMK performs on average 5 times faster than the state-of-the-art CPU-based database management system MonetDB.:I Database Systems & FPGAs
1 INTRODUCTION
1.1 Databases & the Importance of Performance
1.2 Accelerators & FPGAs
1.3 Requirements
1.4 Outline & Summary of Contributions
2 BACKGROUND ON DATABASE SYSTEMS
2.1 Databases
2.1.1 Storage Model
2.1.2 Storage Medium
2.2 Database Operators
2.2.1 Projection
2.2.2 Filter
2.2.3 Sort
2.2.4 Aggregation
2.2.5 Join
2.2.6 Operator Classification
2.3 Database Queries
2.4 Impact of Acceleration
3 BACKGROUND ON FPGAS
3.1 FPGA
3.1.1 Logic Element
3.1.2 Block RAM (BRAM)
3.1.3 Digital Signal Processor (DSP)
3.1.4 IO Element
3.1.5 Programmable Interconnect
3.2 FPGADesignFlow
3.2.1 Specifications
3.2.2 RTL Description
3.2.3 Verification
3.2.4 Synthesis, Mapping, Placement, and Routing
3.2.5 TimingAnalysis
3.2.6 Bitstream Generation and FPGA Programming
3.3 Implementation Quality Metrics
3.4 FPGA Cards
3.5 Benefits of Using FPGAs
3.6 Challenges of Using FPGAs
4 RELATED WORK
4.1 Summary of Related Work
4.2 Platform Type
4.2.1 Accelerator Card
4.2.2 Coprocessor
4.2.3 Smart Storage
4.2.4 Network Processor
4.3 Implementation
4.3.1 Loop-based implementation
4.3.2 Sort-based Implementation
4.3.3 Hash-based Implementation
4.3.4 Mixed Implementation
4.4 A Note on Quantitative Performance Comparisons
II Cache-Based Morphing Sort-Merge with KeRRaS (CbMSMK)
5 OBJECTIVES AND ARCHITECTURE OVERVIEW
5.1 From Requirements to Objectives
5.2 Architecture Overview
5.3 Outlineof Part II
6 COMPARATIVE ANALYSIS OF OPENCL AND RTL FOR SORT-MERGE PRIMITIVES ON FPGAS
6.1 Programming FPGAs
6.2 RelatedWork
6.3 Architecture
6.3.1 Global Architecture
6.3.2 Sorter Architecture
6.3.3 Merger Architecture
6.3.4 Scalability and Resource Adaptability
6.4 Experiments
6.4.1 OpenCL Sort-Merge Implementation
6.4.2 RTLSorters
6.4.3 RTLMergers
6.4.4 Hybrid OpenCL-RTL Sort-Merge Implementation
6.5 Summary & Discussion
7 RESOURCE-EFFICIENT ACCELERATION OF PIPELINE-BREAKING DATABASE OPERATORS ON FPGAS
7.1 The Case for Resource Efficiency
7.2 Related Work
7.3 Architecture
7.3.1 Sorters
7.3.2 Sort-Network
7.3.3 X:Y Mergers
7.3.4 Merge-Network
7.3.5 Join Materialiser (JoinMat)
7.4 Experiments
7.4.1 Experimental Setup
7.4.2 Implementation Description & Tuning
7.4.3 Sort Benchmarks
7.4.4 Aggregation Benchmarks
7.4.5 Join Benchmarks
7. Summary
8 KERRAS: COLUMN-ORIENTED WIDE TABLE PROCESSING ON FPGAS
8.1 The Scope of Database System Accelerators
8.2 Related Work
8.3 Key-Reduce Radix Sort(KeRRaS)
8.3.1 Time Complexity
8.3.2 Space Complexity (Memory Utilization)
8.3.3 Discussion and Optimizations
8.4 Architecture
8.4.1 MSM
8.4.2 MSMK: Extending MSM with KeRRaS
8.4.3 Payload, Aggregation and Join Processing
8.4.4 Limitations
8.5 Experiments
8.5.1 Experimental Setup
8.5.2 Datasets
8.5.3 MSMK vs. MSM
8.5.4 Payload-Less Benchmarks
8.5.5 Payload-Based Benchmarks
8.5.6 Flexibility
8.6 Summary
9 A STUDY OF EARLY AGGREGATION IN DATABASE QUERY PROCESSING ON FPGAS
9.1 Early Aggregation
9.2 Background & Related Work
9.2.1 Sort-Based Early Aggregation
9.2.2 Cache-Based Early Aggregation
9.3 Simulations
9.3.1 Datasets
9.3.2 Metrics
9.3.3 Sort-Based Versus Cache-Based Early Aggregation
9.3.4 Comparison of Set-Associative Caches
9.3.5 Comparison of Cache Structures
9.3.6 Comparison of Replacement Policies
9.3.7 Cache Selection Methodology
9.4 Cache System Architecture
9.4.1 Window Aggregator
9.4.2 Compressor & Hasher
9.4.3 Collision Detector
9.4.4 Collision Resolver
9.4.5 Cache
9.5 Experiments
9.5.1 Experimental Setup
9.5.2 Resource Utilization and Parameter Tuning
9.5.3 Datasets
9.5.4 Benchmarks on Synthetic Data
9.5.5 Benchmarks on Real Data
9.6 Summary
10 THE FULL PICTURE
10.1 System Architecture
10.2 Benchmarks
10.3 Meeting the Objectives
III Conclusion
11 SUMMARY AND OUTLOOK ON FUTURE RESEARCH
11.1 Summary
11.2 Future Work
BIBLIOGRAPHY
LIST OF FIGURES
LIST OF TABLE
Taming Numbers and Durations in the Model Checking Integrated Planning System
The Model Checking Integrated Planning System (MIPS) is a temporal least
commitment heuristic search planner based on a flexible object-oriented
workbench architecture. Its design clearly separates explicit and symbolic
directed exploration algorithms from the set of on-line and off-line computed
estimates and associated data structures. MIPS has shown distinguished
performance in the last two international planning competitions. In the last
event the description language was extended from pure propositional planning to
include numerical state variables, action durations, and plan quality objective
functions. Plans were no longer sequences of actions but time-stamped
schedules. As a participant of the fully automated track of the competition,
MIPS has proven to be a general system; in each track and every benchmark
domain it efficiently computed plans of remarkable quality. This article
introduces and analyzes the most important algorithmic novelties that were
necessary to tackle the new layers of expressiveness in the benchmark problems
and to achieve a high level of performance. The extensions include critical
path analysis of sequentially generated plans to generate corresponding optimal
parallel plans. The linear time algorithm to compute the parallel plan bypasses
known NP hardness results for partial ordering by scheduling plans with respect
to the set of actions and the imposed precedence relations. The efficiency of
this algorithm also allows us to improve the exploration guidance: for each
encountered planning state the corresponding approximate sequential plan is
scheduled. One major strength of MIPS is its static analysis phase that grounds
and simplifies parameterized predicates, functions and operators, that infers
knowledge to minimize the state description length, and that detects domain
object symmetries. The latter aspect is analyzed in detail. MIPS has been
developed to serve as a complete and optimal state space planner, with
admissible estimates, exploration engines and branching cuts. In the
competition version, however, certain performance compromises had to be made,
including floating point arithmetic, weighted heuristic search exploration
according to an inadmissible estimate and parameterized optimization
A Method for Image Classification Using Low-Precision Analog Computing Arrays
Computing with analog micro electronics can offer several advantages over standard digital technology, most notably: Low space and power consumption and massive parallelization. On the other hand, analog computation lacks the exactness of digital calculations due to inevitable device variations introduced during the chip production, but also due to electric noise in the analog signals. Artificial neural networks are well suited for parallel analog implementations, first, because of their inherent parallelity and second, because they can adapt to device imperfections by training. This thesis evaluates the feasibility of implementing a convolutional neural network for image classification on a massively parallel low-power hardware system. A particular, mixed analogdigital, hardware model is considered, featuring simple threshold neurons. Appropriate, gradient-free, training algorithms, combining self-organization and supervised learning are developed and tested with two benchmark problems (MNIST hand-written digits and traffic signs). Software simulations evaluate the methods under various defined computation faults. A model-free closed-loop technique is shown to compensate for rather serious computation errors without the need for explicit error quantification. Last but not least, the developed networks and the training techniques are verified on a real prototype chip
Topics in Compressed Sensing
Compressed sensing has a wide range of applications that include error correction, imaging, radar and many more. Given a sparse signal in a high dimensional space, one wishes to reconstruct that signal accurately and efficiently from a number of linear measurements much less than its actual dimension. Although in theory it is clear that this is possible, the difficulty lies in the construction of algorithms that perform the recovery efficiently, as well as determining which kind of linear measurements allow for the reconstruction. There have been two distinct major approaches to sparse recovery that each present different benefits and shortcomings. The first, L1-minimization methods such as Basis Pursuit, use a linear optimization problem to recover the signal. This method provides strong guarantees and stability, but relies on Linear Programming, whose methods do not yet have strong polynomially bounded runtimes. The second approach uses greedy methods that compute the support of the signal iteratively. These methods are usually much faster than Basis Pursuit, but until recently had not been able to provide the same guarantees. This gap between the two approaches was bridged when we developed and analyzed the greedy algorithm Regularized Orthogonal Matching Pursuit (ROMP). ROMP provides similar guarantees to Basis Pursuit as well as the speed of a greedy algorithm. Our more recent algorithm Compressive Sampling Matching Pursuit (CoSaMP) improves upon these guarantees, and is optimal in every important aspect
Just-in-time Hardware generation for abstracted reconfigurable computing
This thesis addresses the use of reconfigurable hardware in computing platforms, in order to harness the performance benefits of dedicated hardware whilst maintaining the flexibility associated with software. Although the reconfigurable computing concept is not new, the low level nature of the supporting tools normally used, together with the consequent limited level of abstraction and resultant lack of backwards compatibility, has prevented the widespread adoption of this technology. In addition, bandwidth and architectural limitations, have seriously constrained the potential improvements in performance. A review of existing approaches and tools flows is conducted to highlight the current problems being faced in this field. The objective of the work presented in this thesis is to introduce a radically new approach to reconfigurable computing tool flows. The runtime based tool flow introduces complete abstraction between the application developer and the underlying hardware. This new technique eliminates the ease of use and backwards compatibility issues that have plagued the reconfigurable computing concept, and could pave the way for viable mainstream reconfigurable computing platforms. An easy to use, cycle accurate behavioural modelling system is also presented, which was used extensively during the early exploration of new concepts and architectures. Some performance improvements produced by the new reconfigurable computing tool flow, when applied to both a MIPS based embedded platform, and the Cray XDl, are also presented. These results are then analyzed and the hardware and software factors affecting the performance increases that were obtained are discussed, together with potential techniques that could be used to further increase the performance of the system. Lastly a heterogenous computing concept is proposed, in which, a computer system, containing multiple types of computational resource is envisaged, each having their own strengths and weaknesses (e.g. DSPs, CPUs, FPGAs). A revolutionary new method of fully exploiting the potential of such a system, whilst maintaining scalability, backwards compatibility, and ease of use is also presented
Deep neural mobile networking
The next generation of mobile networks is set to become increasingly complex, as these struggle to accommodate tremendous data traffic demands generated by ever-more connected devices that have diverse performance requirements in terms of throughput, latency, and reliability. This makes monitoring and managing the multitude of network elements intractable with existing tools and impractical for traditional machine learning algorithms that rely on hand-crafted feature engineering. In this context, embedding machine intelligence into mobile networks becomes necessary, as this enables systematic mining of valuable information from mobile big data and automatically uncovering correlations that would otherwise have been too difficult to extract by human experts. In particular, deep learning based solutions can automatically extract features from raw data, without human expertise. The performance of artificial intelligence (AI) has achieved in other domains draws unprecedented interest from both academia and industry in employing deep learning approaches to address technical challenges in mobile networks.
This thesis attacks important problems in the mobile networking area from various perspectives by harnessing recent advances in deep neural networks. As a preamble, we bridge the gap between deep learning and mobile networking by presenting a survey on the crossovers between the two areas. Secondly, we design dedicated deep learning architectures to forecast mobile traffic consumption at city scale. In particular, we tailor our deep neural network models to different mobile traffic data structures (i.e. data originating from urban grids and geospatial point-cloud antenna deployments) to deliver precise prediction. Next, we propose a mobile traffic super resolution (MTSR) technique to achieve coarse-to-fine grain transformations on mobile traffic measurements using generative adversarial network architectures. This can provide insightful knowledge to mobile operators about mobile traffic distribution, while effectively reducing the data post-processing overhead. Subsequently, the mobile traffic decomposition (MTD) technique is proposed to break the aggregated mobile traffic measurements into service-level time series, by using a deep learning based framework. With MTD, mobile operators can perform more efficient resource allocation for network slicing (i.e, the logical partitioning of physical infrastructure) and alleviate the privacy concerns that come with the extensive use of deep packet inspection. Finally, we study the robustness of network specific deep anomaly detectors with a realistic black-box threat model and propose reliable solutions for defending against attacks that seek to subvert existing network deep learning based intrusion detection systems (NIDS).
Lastly, based on the results obtained, we identify important research directions that are worth pursuing in the future, including (i) serving deep learning with massive high-quality data (ii) deep learning for spatio-temporal mobile data mining (iii) deep learning for geometric mobile data mining (iv) deep unsupervised learning in mobile networks, and (v) deep reinforcement learning for mobile network control. Overall, this thesis demonstrates that deep learning can underpin powerful tools that address data-driven problems in the mobile networking domain. With such intelligence, future mobile networks can be monitored and managed more effectively and thus higher user quality of experience can be guaranteed
Trustworthy Federated Learning: A Survey
Federated Learning (FL) has emerged as a significant advancement in the field
of Artificial Intelligence (AI), enabling collaborative model training across
distributed devices while maintaining data privacy. As the importance of FL
increases, addressing trustworthiness issues in its various aspects becomes
crucial. In this survey, we provide an extensive overview of the current state
of Trustworthy FL, exploring existing solutions and well-defined pillars
relevant to Trustworthy . Despite the growth in literature on trustworthy
centralized Machine Learning (ML)/Deep Learning (DL), further efforts are
necessary to identify trustworthiness pillars and evaluation metrics specific
to FL models, as well as to develop solutions for computing trustworthiness
levels. We propose a taxonomy that encompasses three main pillars:
Interpretability, Fairness, and Security & Privacy. Each pillar represents a
dimension of trust, further broken down into different notions. Our survey
covers trustworthiness challenges at every level in FL settings. We present a
comprehensive architecture of Trustworthy FL, addressing the fundamental
principles underlying the concept, and offer an in-depth analysis of trust
assessment mechanisms. In conclusion, we identify key research challenges
related to every aspect of Trustworthy FL and suggest future research
directions. This comprehensive survey serves as a valuable resource for
researchers and practitioners working on the development and implementation of
Trustworthy FL systems, contributing to a more secure and reliable AI
landscape.Comment: 45 Pages, 8 Figures, 9 Table
Language and compiler support for dyanmic code generation
Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 131-135).by Massimiliano A. Poletto.Ph.D