38 research outputs found

    Database System Acceleration on FPGAs

    Get PDF
    Relational database systems provide various services and applications with an efficient means for storing, processing, and retrieving their data. The performance of these systems has a direct impact on the quality of service of the applications that rely on them. Therefore, it is crucial that database systems are able to adapt and grow in tandem with the demands of these applications, ensuring that their performance scales accordingly. In the past, Moore's law and algorithmic advancements have been sufficient to meet these demands. However, with the slowdown of Moore's law, researchers have begun exploring alternative methods, such as application-specific technologies, to satisfy the more challenging performance requirements. One such technology is field-programmable gate arrays (FPGAs), which provide ideal platforms for developing and running custom architectures for accelerating database systems. The goal of this thesis is to develop a domain-specific architecture that can enhance the performance of in-memory database systems when executing analytical queries. Our research is guided by a combination of academic and industrial requirements that seek to strike a balance between generality and performance. The former ensures that our platform can be used to process a diverse range of workloads, while the latter makes it an attractive solution for high-performance use cases. Throughout this thesis, we present the development of a system-on-chip for database system acceleration that meets our requirements. The resulting architecture, called CbMSMK, is capable of processing the projection, sort, aggregation, and equi-join database operators and can also run some complex TPC-H queries. CbMSMK employs a shared sort-merge pipeline for executing all these operators, which results in an efficient use of FPGA resources. This approach enables the instantiation of multiple acceleration cores on the FPGA, allowing it to serve multiple clients simultaneously. CbMSMK can process both arbitrarily deep and wide tables efficiently. The former is achieved through the use of the sort-merge algorithm which utilizes the FPGA RAM for buffering intermediate sort results. The latter is achieved through the use of KeRRaS, a novel variant of the forward radix sort algorithm introduced in this thesis. KeRRaS allows CbMSMK to process a table a few columns at a time, incrementally generating the final result through multiple iterations. Given that acceleration is a key objective of our work, CbMSMK benefits from many performance optimizations. For instance, multi-way merging is employed to reduce the number of merge passes required for the execution of the sort-merge algorithm, thus improving the performance of all our pipeline-breaking operators. Another example is our in-depth analysis of early aggregation, which led to the development of a novel cache-based algorithm that significantly enhances aggregation performance. Our experiments demonstrate that CbMSMK performs on average 5 times faster than the state-of-the-art CPU-based database management system MonetDB.:I Database Systems & FPGAs 1 INTRODUCTION 1.1 Databases & the Importance of Performance 1.2 Accelerators & FPGAs 1.3 Requirements 1.4 Outline & Summary of Contributions 2 BACKGROUND ON DATABASE SYSTEMS 2.1 Databases 2.1.1 Storage Model 2.1.2 Storage Medium 2.2 Database Operators 2.2.1 Projection 2.2.2 Filter 2.2.3 Sort 2.2.4 Aggregation 2.2.5 Join 2.2.6 Operator Classification 2.3 Database Queries 2.4 Impact of Acceleration 3 BACKGROUND ON FPGAS 3.1 FPGA 3.1.1 Logic Element 3.1.2 Block RAM (BRAM) 3.1.3 Digital Signal Processor (DSP) 3.1.4 IO Element 3.1.5 Programmable Interconnect 3.2 FPGADesignFlow 3.2.1 Specifications 3.2.2 RTL Description 3.2.3 Verification 3.2.4 Synthesis, Mapping, Placement, and Routing 3.2.5 TimingAnalysis 3.2.6 Bitstream Generation and FPGA Programming 3.3 Implementation Quality Metrics 3.4 FPGA Cards 3.5 Benefits of Using FPGAs 3.6 Challenges of Using FPGAs 4 RELATED WORK 4.1 Summary of Related Work 4.2 Platform Type 4.2.1 Accelerator Card 4.2.2 Coprocessor 4.2.3 Smart Storage 4.2.4 Network Processor 4.3 Implementation 4.3.1 Loop-based implementation 4.3.2 Sort-based Implementation 4.3.3 Hash-based Implementation 4.3.4 Mixed Implementation 4.4 A Note on Quantitative Performance Comparisons II Cache-Based Morphing Sort-Merge with KeRRaS (CbMSMK) 5 OBJECTIVES AND ARCHITECTURE OVERVIEW 5.1 From Requirements to Objectives 5.2 Architecture Overview 5.3 Outlineof Part II 6 COMPARATIVE ANALYSIS OF OPENCL AND RTL FOR SORT-MERGE PRIMITIVES ON FPGAS 6.1 Programming FPGAs 6.2 RelatedWork 6.3 Architecture 6.3.1 Global Architecture 6.3.2 Sorter Architecture 6.3.3 Merger Architecture 6.3.4 Scalability and Resource Adaptability 6.4 Experiments 6.4.1 OpenCL Sort-Merge Implementation 6.4.2 RTLSorters 6.4.3 RTLMergers 6.4.4 Hybrid OpenCL-RTL Sort-Merge Implementation 6.5 Summary & Discussion 7 RESOURCE-EFFICIENT ACCELERATION OF PIPELINE-BREAKING DATABASE OPERATORS ON FPGAS 7.1 The Case for Resource Efficiency 7.2 Related Work 7.3 Architecture 7.3.1 Sorters 7.3.2 Sort-Network 7.3.3 X:Y Mergers 7.3.4 Merge-Network 7.3.5 Join Materialiser (JoinMat) 7.4 Experiments 7.4.1 Experimental Setup 7.4.2 Implementation Description & Tuning 7.4.3 Sort Benchmarks 7.4.4 Aggregation Benchmarks 7.4.5 Join Benchmarks 7. Summary 8 KERRAS: COLUMN-ORIENTED WIDE TABLE PROCESSING ON FPGAS 8.1 The Scope of Database System Accelerators 8.2 Related Work 8.3 Key-Reduce Radix Sort(KeRRaS) 8.3.1 Time Complexity 8.3.2 Space Complexity (Memory Utilization) 8.3.3 Discussion and Optimizations 8.4 Architecture 8.4.1 MSM 8.4.2 MSMK: Extending MSM with KeRRaS 8.4.3 Payload, Aggregation and Join Processing 8.4.4 Limitations 8.5 Experiments 8.5.1 Experimental Setup 8.5.2 Datasets 8.5.3 MSMK vs. MSM 8.5.4 Payload-Less Benchmarks 8.5.5 Payload-Based Benchmarks 8.5.6 Flexibility 8.6 Summary 9 A STUDY OF EARLY AGGREGATION IN DATABASE QUERY PROCESSING ON FPGAS 9.1 Early Aggregation 9.2 Background & Related Work 9.2.1 Sort-Based Early Aggregation 9.2.2 Cache-Based Early Aggregation 9.3 Simulations 9.3.1 Datasets 9.3.2 Metrics 9.3.3 Sort-Based Versus Cache-Based Early Aggregation 9.3.4 Comparison of Set-Associative Caches 9.3.5 Comparison of Cache Structures 9.3.6 Comparison of Replacement Policies 9.3.7 Cache Selection Methodology 9.4 Cache System Architecture 9.4.1 Window Aggregator 9.4.2 Compressor & Hasher 9.4.3 Collision Detector 9.4.4 Collision Resolver 9.4.5 Cache 9.5 Experiments 9.5.1 Experimental Setup 9.5.2 Resource Utilization and Parameter Tuning 9.5.3 Datasets 9.5.4 Benchmarks on Synthetic Data 9.5.5 Benchmarks on Real Data 9.6 Summary 10 THE FULL PICTURE 10.1 System Architecture 10.2 Benchmarks 10.3 Meeting the Objectives III Conclusion 11 SUMMARY AND OUTLOOK ON FUTURE RESEARCH 11.1 Summary 11.2 Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLE

    Taming Numbers and Durations in the Model Checking Integrated Planning System

    Full text link
    The Model Checking Integrated Planning System (MIPS) is a temporal least commitment heuristic search planner based on a flexible object-oriented workbench architecture. Its design clearly separates explicit and symbolic directed exploration algorithms from the set of on-line and off-line computed estimates and associated data structures. MIPS has shown distinguished performance in the last two international planning competitions. In the last event the description language was extended from pure propositional planning to include numerical state variables, action durations, and plan quality objective functions. Plans were no longer sequences of actions but time-stamped schedules. As a participant of the fully automated track of the competition, MIPS has proven to be a general system; in each track and every benchmark domain it efficiently computed plans of remarkable quality. This article introduces and analyzes the most important algorithmic novelties that were necessary to tackle the new layers of expressiveness in the benchmark problems and to achieve a high level of performance. The extensions include critical path analysis of sequentially generated plans to generate corresponding optimal parallel plans. The linear time algorithm to compute the parallel plan bypasses known NP hardness results for partial ordering by scheduling plans with respect to the set of actions and the imposed precedence relations. The efficiency of this algorithm also allows us to improve the exploration guidance: for each encountered planning state the corresponding approximate sequential plan is scheduled. One major strength of MIPS is its static analysis phase that grounds and simplifies parameterized predicates, functions and operators, that infers knowledge to minimize the state description length, and that detects domain object symmetries. The latter aspect is analyzed in detail. MIPS has been developed to serve as a complete and optimal state space planner, with admissible estimates, exploration engines and branching cuts. In the competition version, however, certain performance compromises had to be made, including floating point arithmetic, weighted heuristic search exploration according to an inadmissible estimate and parameterized optimization

    A Method for Image Classification Using Low-Precision Analog Computing Arrays

    Get PDF
    Computing with analog micro electronics can offer several advantages over standard digital technology, most notably: Low space and power consumption and massive parallelization. On the other hand, analog computation lacks the exactness of digital calculations due to inevitable device variations introduced during the chip production, but also due to electric noise in the analog signals. Artificial neural networks are well suited for parallel analog implementations, first, because of their inherent parallelity and second, because they can adapt to device imperfections by training. This thesis evaluates the feasibility of implementing a convolutional neural network for image classification on a massively parallel low-power hardware system. A particular, mixed analogdigital, hardware model is considered, featuring simple threshold neurons. Appropriate, gradient-free, training algorithms, combining self-organization and supervised learning are developed and tested with two benchmark problems (MNIST hand-written digits and traffic signs). Software simulations evaluate the methods under various defined computation faults. A model-free closed-loop technique is shown to compensate for rather serious computation errors without the need for explicit error quantification. Last but not least, the developed networks and the training techniques are verified on a real prototype chip

    Topics in Compressed Sensing

    Get PDF
    Compressed sensing has a wide range of applications that include error correction, imaging, radar and many more. Given a sparse signal in a high dimensional space, one wishes to reconstruct that signal accurately and efficiently from a number of linear measurements much less than its actual dimension. Although in theory it is clear that this is possible, the difficulty lies in the construction of algorithms that perform the recovery efficiently, as well as determining which kind of linear measurements allow for the reconstruction. There have been two distinct major approaches to sparse recovery that each present different benefits and shortcomings. The first, L1-minimization methods such as Basis Pursuit, use a linear optimization problem to recover the signal. This method provides strong guarantees and stability, but relies on Linear Programming, whose methods do not yet have strong polynomially bounded runtimes. The second approach uses greedy methods that compute the support of the signal iteratively. These methods are usually much faster than Basis Pursuit, but until recently had not been able to provide the same guarantees. This gap between the two approaches was bridged when we developed and analyzed the greedy algorithm Regularized Orthogonal Matching Pursuit (ROMP). ROMP provides similar guarantees to Basis Pursuit as well as the speed of a greedy algorithm. Our more recent algorithm Compressive Sampling Matching Pursuit (CoSaMP) improves upon these guarantees, and is optimal in every important aspect

    Just-in-time Hardware generation for abstracted reconfigurable computing

    Get PDF
    This thesis addresses the use of reconfigurable hardware in computing platforms, in order to harness the performance benefits of dedicated hardware whilst maintaining the flexibility associated with software. Although the reconfigurable computing concept is not new, the low level nature of the supporting tools normally used, together with the consequent limited level of abstraction and resultant lack of backwards compatibility, has prevented the widespread adoption of this technology. In addition, bandwidth and architectural limitations, have seriously constrained the potential improvements in performance. A review of existing approaches and tools flows is conducted to highlight the current problems being faced in this field. The objective of the work presented in this thesis is to introduce a radically new approach to reconfigurable computing tool flows. The runtime based tool flow introduces complete abstraction between the application developer and the underlying hardware. This new technique eliminates the ease of use and backwards compatibility issues that have plagued the reconfigurable computing concept, and could pave the way for viable mainstream reconfigurable computing platforms. An easy to use, cycle accurate behavioural modelling system is also presented, which was used extensively during the early exploration of new concepts and architectures. Some performance improvements produced by the new reconfigurable computing tool flow, when applied to both a MIPS based embedded platform, and the Cray XDl, are also presented. These results are then analyzed and the hardware and software factors affecting the performance increases that were obtained are discussed, together with potential techniques that could be used to further increase the performance of the system. Lastly a heterogenous computing concept is proposed, in which, a computer system, containing multiple types of computational resource is envisaged, each having their own strengths and weaknesses (e.g. DSPs, CPUs, FPGAs). A revolutionary new method of fully exploiting the potential of such a system, whilst maintaining scalability, backwards compatibility, and ease of use is also presented

    Deep neural mobile networking

    Get PDF
    The next generation of mobile networks is set to become increasingly complex, as these struggle to accommodate tremendous data traffic demands generated by ever-more connected devices that have diverse performance requirements in terms of throughput, latency, and reliability. This makes monitoring and managing the multitude of network elements intractable with existing tools and impractical for traditional machine learning algorithms that rely on hand-crafted feature engineering. In this context, embedding machine intelligence into mobile networks becomes necessary, as this enables systematic mining of valuable information from mobile big data and automatically uncovering correlations that would otherwise have been too difficult to extract by human experts. In particular, deep learning based solutions can automatically extract features from raw data, without human expertise. The performance of artificial intelligence (AI) has achieved in other domains draws unprecedented interest from both academia and industry in employing deep learning approaches to address technical challenges in mobile networks. This thesis attacks important problems in the mobile networking area from various perspectives by harnessing recent advances in deep neural networks. As a preamble, we bridge the gap between deep learning and mobile networking by presenting a survey on the crossovers between the two areas. Secondly, we design dedicated deep learning architectures to forecast mobile traffic consumption at city scale. In particular, we tailor our deep neural network models to different mobile traffic data structures (i.e. data originating from urban grids and geospatial point-cloud antenna deployments) to deliver precise prediction. Next, we propose a mobile traffic super resolution (MTSR) technique to achieve coarse-to-fine grain transformations on mobile traffic measurements using generative adversarial network architectures. This can provide insightful knowledge to mobile operators about mobile traffic distribution, while effectively reducing the data post-processing overhead. Subsequently, the mobile traffic decomposition (MTD) technique is proposed to break the aggregated mobile traffic measurements into service-level time series, by using a deep learning based framework. With MTD, mobile operators can perform more efficient resource allocation for network slicing (i.e, the logical partitioning of physical infrastructure) and alleviate the privacy concerns that come with the extensive use of deep packet inspection. Finally, we study the robustness of network specific deep anomaly detectors with a realistic black-box threat model and propose reliable solutions for defending against attacks that seek to subvert existing network deep learning based intrusion detection systems (NIDS). Lastly, based on the results obtained, we identify important research directions that are worth pursuing in the future, including (i) serving deep learning with massive high-quality data (ii) deep learning for spatio-temporal mobile data mining (iii) deep learning for geometric mobile data mining (iv) deep unsupervised learning in mobile networks, and (v) deep reinforcement learning for mobile network control. Overall, this thesis demonstrates that deep learning can underpin powerful tools that address data-driven problems in the mobile networking domain. With such intelligence, future mobile networks can be monitored and managed more effectively and thus higher user quality of experience can be guaranteed

    Trustworthy Federated Learning: A Survey

    Full text link
    Federated Learning (FL) has emerged as a significant advancement in the field of Artificial Intelligence (AI), enabling collaborative model training across distributed devices while maintaining data privacy. As the importance of FL increases, addressing trustworthiness issues in its various aspects becomes crucial. In this survey, we provide an extensive overview of the current state of Trustworthy FL, exploring existing solutions and well-defined pillars relevant to Trustworthy . Despite the growth in literature on trustworthy centralized Machine Learning (ML)/Deep Learning (DL), further efforts are necessary to identify trustworthiness pillars and evaluation metrics specific to FL models, as well as to develop solutions for computing trustworthiness levels. We propose a taxonomy that encompasses three main pillars: Interpretability, Fairness, and Security & Privacy. Each pillar represents a dimension of trust, further broken down into different notions. Our survey covers trustworthiness challenges at every level in FL settings. We present a comprehensive architecture of Trustworthy FL, addressing the fundamental principles underlying the concept, and offer an in-depth analysis of trust assessment mechanisms. In conclusion, we identify key research challenges related to every aspect of Trustworthy FL and suggest future research directions. This comprehensive survey serves as a valuable resource for researchers and practitioners working on the development and implementation of Trustworthy FL systems, contributing to a more secure and reliable AI landscape.Comment: 45 Pages, 8 Figures, 9 Table

    Language and compiler support for dyanmic code generation

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references (p. 131-135).by Massimiliano A. Poletto.Ph.D

    15th SC@RUG 2018 proceedings 2017-2018

    Get PDF
    corecore