35 research outputs found

    Implementation of Ultra-Low Latency and High-Speed Communication Channels for an FPGA-Based HPC Cluster

    Get PDF
    RÉSUMÉ Les clusters basĂ©s sur les FPGA bĂ©nĂ©ficient de leur flexibilitĂ© et de leurs performances en termes de puissance de calcul et de faible consommation. Et puisque la consommation de puissance devient un Ă©lĂ©ment de plus en plus importants sur le marchĂ© des superordinateurs, le domaine d’exploration multi-FPGA devient chaque annĂ©e plus populaire. Les performances des ordinateurs n’ont jamais cessĂ© d’augmenter mais la latence des rĂ©seaux d’interconnexion n’a pas suivi leur taux d’amĂ©lioration. Dans le but d’augmenter le niveau d’abstraction et les fonctionnalitĂ©s des interconnexions, la complexitĂ© des piles de communication atteinte Ă  nos jours engendre des coĂ»ts et affecte la latence des communications, ce qui rend ces piles de communication trĂšs souvent inefficaces, voire inutiles. Les protocoles de communication commerciaux existants et les contrĂŽleurs d’interfaces rĂ©seau FPGA-FPGA n’ont la performance pour supporter ni les applications Ă  temps critique ni un partitionnement Ă©troitement couplĂ© des systĂšmes sur puce. Au lieu de cela, les approches de communication personnalisĂ©es sont souvent prĂ©fĂ©rĂ©es. Dans ce travail, nous proposons une implĂ©mentation de canaux de communication Ă  haut dĂ©bit et Ă  faible latence pour une grappe de FPGA. Le systĂšme est constituĂ© de deux BEE3, chacun contenant 4 FPGA de la famille Virtex-5 interconnectĂ©s par une topologie en anneau. Notre approche exploite la technologie Ă  transducteur Ă  plusieurs gigabits par seconde pour l’obtention d’une bande passante fiable de 8Gbps. Le module de propriĂ©tĂ© intellectuelle (IP) de communication proposĂ© permet le transfert de donnĂ©es entre des milliers de coprocesseurs sur le rĂ©seau, grĂące Ă  l’implĂ©mentation d’un rĂ©seau direct avec capacitĂ© de routage de paquets. Les rĂ©sultats expĂ©rimentaux ont montrĂ© une latence de seulement 34 cycles d’horloge entre deux noeuds voisins, ce qui est un des plus bas parmi ceux rapportĂ©s dans la littĂ©rature. En outre, nous proposons une architecture adaptĂ©e au calcul Ă  haute performance qui comporte un traitement extensible, parallĂšle et distribuĂ©. Pour une plateforme Ă  8 FPGA, l’architecture fournit 35.6Go/s de bande passante effective pour la mĂ©moire externe, une bande passante globale de rĂ©seau de 128Gbps et une puissance de calcul de 8.9GFLOPS. Un solveur matrice-vecteur de grande taille est partitionnĂ© et mis en oeuvre Ă  travers le cluster. Nous avons obtenu une performance et une efficacitĂ© de calcul concurrentielles grĂące Ă  la faible empreinte du protocole de communication entre les Ă©lĂ©ments de traitement distribuĂ©s. Ce travail contribue Ă  soutenir de nouvelles recherches dans le domaine du calcul parallĂšle intensif et permet le partitionnement de systĂšme sur puce Ă  grande taille sur des clusters Ă  base de FPGA.----------ABSTRACT An FPGA-based cluster profits from the flexibility and the performance potential FPGA technology provides. Since price and power consumption are becoming increasingly important elements in the High-Performance Computing market, the multi-FPGA exploration field is getting more popular each year. Network latency has failed to keep up with other improvements in computer performance. Complex communication stacks have sacrificed latency and increased overhead to achieve other goals, being in most of the time inefficient and unnecessary. The existing commercial offthe- shelf communication protocols and Network Interfaces Controllers for FPGA-to-FPGA interconnection lack of performance to support time-critical applications and tightly coupled System-on-Chip partitioning. Instead, custom communication approaches are preferred. In this work, ultra-low latency and high-speed communication channels for an FPGA-based cluster are presented. Two BEE3s grouping 8 FPGAs Virtex-5 interconnected in a ring topology, compose the targeting platform. Our approach exploits Multi-Gigabit Transceiver technology to achieve reliable 8Gbps channel bandwidth. The proposed communication IP supports data transfer from coprocessors over the network, by means of a direct network implementation with hop-by-hop packet routing capability. Experimental results showed a latency of only 34 clock cycles between two neighboring nodes, being one of the lowest in the literature. In addition, it is proposed an architecture suitable for High-Performance Computing which includes performing scalable, parallel, and distributed processing. For an 8 FPGAs platform, the architecture provides 35.6GB/s off-chip memory throughput, 128Gbps network aggregate bandwidth, and 8.9GFLOPS computing power. A large and dense matrix-vector solver is partitioned and implemented across the cluster. We achieved competitive performance and computational efficiency as a result of the low communication overhead among the distributed processing elements. This work contributes to support new researches on the intense parallel computing fields, and enables large System-on-Chip partitioning and scaling on FPGA-based clusters

    Solid State Disk drive synthetic performances Analysis of 4th Gen. NVMe Protocol support

    Get PDF
    This paper shows synthetic performance analysis of Solid State Disk drive that supports NVMe 4.0 protocol. Results are presented by using disk benchmarking tools Cristal Disk Benchmark and ATTO Disk tool on referent testing system. Also, synthetic tests were performed by measurement sequential read/write and random read/write performances with different queues depth and data block sizes of 4K, 32K, 256K and 8 MB. All results were compared with an older protocol standard NVMe 3.0 and also with SATA III standard

    Database System Acceleration on FPGAs

    Get PDF
    Relational database systems provide various services and applications with an efficient means for storing, processing, and retrieving their data. The performance of these systems has a direct impact on the quality of service of the applications that rely on them. Therefore, it is crucial that database systems are able to adapt and grow in tandem with the demands of these applications, ensuring that their performance scales accordingly. In the past, Moore's law and algorithmic advancements have been sufficient to meet these demands. However, with the slowdown of Moore's law, researchers have begun exploring alternative methods, such as application-specific technologies, to satisfy the more challenging performance requirements. One such technology is field-programmable gate arrays (FPGAs), which provide ideal platforms for developing and running custom architectures for accelerating database systems. The goal of this thesis is to develop a domain-specific architecture that can enhance the performance of in-memory database systems when executing analytical queries. Our research is guided by a combination of academic and industrial requirements that seek to strike a balance between generality and performance. The former ensures that our platform can be used to process a diverse range of workloads, while the latter makes it an attractive solution for high-performance use cases. Throughout this thesis, we present the development of a system-on-chip for database system acceleration that meets our requirements. The resulting architecture, called CbMSMK, is capable of processing the projection, sort, aggregation, and equi-join database operators and can also run some complex TPC-H queries. CbMSMK employs a shared sort-merge pipeline for executing all these operators, which results in an efficient use of FPGA resources. This approach enables the instantiation of multiple acceleration cores on the FPGA, allowing it to serve multiple clients simultaneously. CbMSMK can process both arbitrarily deep and wide tables efficiently. The former is achieved through the use of the sort-merge algorithm which utilizes the FPGA RAM for buffering intermediate sort results. The latter is achieved through the use of KeRRaS, a novel variant of the forward radix sort algorithm introduced in this thesis. KeRRaS allows CbMSMK to process a table a few columns at a time, incrementally generating the final result through multiple iterations. Given that acceleration is a key objective of our work, CbMSMK benefits from many performance optimizations. For instance, multi-way merging is employed to reduce the number of merge passes required for the execution of the sort-merge algorithm, thus improving the performance of all our pipeline-breaking operators. Another example is our in-depth analysis of early aggregation, which led to the development of a novel cache-based algorithm that significantly enhances aggregation performance. Our experiments demonstrate that CbMSMK performs on average 5 times faster than the state-of-the-art CPU-based database management system MonetDB.:I Database Systems & FPGAs 1 INTRODUCTION 1.1 Databases & the Importance of Performance 1.2 Accelerators & FPGAs 1.3 Requirements 1.4 Outline & Summary of Contributions 2 BACKGROUND ON DATABASE SYSTEMS 2.1 Databases 2.1.1 Storage Model 2.1.2 Storage Medium 2.2 Database Operators 2.2.1 Projection 2.2.2 Filter 2.2.3 Sort 2.2.4 Aggregation 2.2.5 Join 2.2.6 Operator Classification 2.3 Database Queries 2.4 Impact of Acceleration 3 BACKGROUND ON FPGAS 3.1 FPGA 3.1.1 Logic Element 3.1.2 Block RAM (BRAM) 3.1.3 Digital Signal Processor (DSP) 3.1.4 IO Element 3.1.5 Programmable Interconnect 3.2 FPGADesignFlow 3.2.1 Specifications 3.2.2 RTL Description 3.2.3 Verification 3.2.4 Synthesis, Mapping, Placement, and Routing 3.2.5 TimingAnalysis 3.2.6 Bitstream Generation and FPGA Programming 3.3 Implementation Quality Metrics 3.4 FPGA Cards 3.5 Benefits of Using FPGAs 3.6 Challenges of Using FPGAs 4 RELATED WORK 4.1 Summary of Related Work 4.2 Platform Type 4.2.1 Accelerator Card 4.2.2 Coprocessor 4.2.3 Smart Storage 4.2.4 Network Processor 4.3 Implementation 4.3.1 Loop-based implementation 4.3.2 Sort-based Implementation 4.3.3 Hash-based Implementation 4.3.4 Mixed Implementation 4.4 A Note on Quantitative Performance Comparisons II Cache-Based Morphing Sort-Merge with KeRRaS (CbMSMK) 5 OBJECTIVES AND ARCHITECTURE OVERVIEW 5.1 From Requirements to Objectives 5.2 Architecture Overview 5.3 Outlineof Part II 6 COMPARATIVE ANALYSIS OF OPENCL AND RTL FOR SORT-MERGE PRIMITIVES ON FPGAS 6.1 Programming FPGAs 6.2 RelatedWork 6.3 Architecture 6.3.1 Global Architecture 6.3.2 Sorter Architecture 6.3.3 Merger Architecture 6.3.4 Scalability and Resource Adaptability 6.4 Experiments 6.4.1 OpenCL Sort-Merge Implementation 6.4.2 RTLSorters 6.4.3 RTLMergers 6.4.4 Hybrid OpenCL-RTL Sort-Merge Implementation 6.5 Summary & Discussion 7 RESOURCE-EFFICIENT ACCELERATION OF PIPELINE-BREAKING DATABASE OPERATORS ON FPGAS 7.1 The Case for Resource Efficiency 7.2 Related Work 7.3 Architecture 7.3.1 Sorters 7.3.2 Sort-Network 7.3.3 X:Y Mergers 7.3.4 Merge-Network 7.3.5 Join Materialiser (JoinMat) 7.4 Experiments 7.4.1 Experimental Setup 7.4.2 Implementation Description & Tuning 7.4.3 Sort Benchmarks 7.4.4 Aggregation Benchmarks 7.4.5 Join Benchmarks 7. Summary 8 KERRAS: COLUMN-ORIENTED WIDE TABLE PROCESSING ON FPGAS 8.1 The Scope of Database System Accelerators 8.2 Related Work 8.3 Key-Reduce Radix Sort(KeRRaS) 8.3.1 Time Complexity 8.3.2 Space Complexity (Memory Utilization) 8.3.3 Discussion and Optimizations 8.4 Architecture 8.4.1 MSM 8.4.2 MSMK: Extending MSM with KeRRaS 8.4.3 Payload, Aggregation and Join Processing 8.4.4 Limitations 8.5 Experiments 8.5.1 Experimental Setup 8.5.2 Datasets 8.5.3 MSMK vs. MSM 8.5.4 Payload-Less Benchmarks 8.5.5 Payload-Based Benchmarks 8.5.6 Flexibility 8.6 Summary 9 A STUDY OF EARLY AGGREGATION IN DATABASE QUERY PROCESSING ON FPGAS 9.1 Early Aggregation 9.2 Background & Related Work 9.2.1 Sort-Based Early Aggregation 9.2.2 Cache-Based Early Aggregation 9.3 Simulations 9.3.1 Datasets 9.3.2 Metrics 9.3.3 Sort-Based Versus Cache-Based Early Aggregation 9.3.4 Comparison of Set-Associative Caches 9.3.5 Comparison of Cache Structures 9.3.6 Comparison of Replacement Policies 9.3.7 Cache Selection Methodology 9.4 Cache System Architecture 9.4.1 Window Aggregator 9.4.2 Compressor & Hasher 9.4.3 Collision Detector 9.4.4 Collision Resolver 9.4.5 Cache 9.5 Experiments 9.5.1 Experimental Setup 9.5.2 Resource Utilization and Parameter Tuning 9.5.3 Datasets 9.5.4 Benchmarks on Synthetic Data 9.5.5 Benchmarks on Real Data 9.6 Summary 10 THE FULL PICTURE 10.1 System Architecture 10.2 Benchmarks 10.3 Meeting the Objectives III Conclusion 11 SUMMARY AND OUTLOOK ON FUTURE RESEARCH 11.1 Summary 11.2 Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLE

    Fixed-latency system for high-speed serial transmission between FPGA devices with Forward Error Correction

    Get PDF
    This paper presents the design of a compact pro-tocol for fixed-latency, high-speed, reliable, serial transmissionbetween simple field-programmable gate arrays (FPGA) devices.Implementation of the project aims to delineate word boundaries,provide randomness to the electromagnetic interference (EMI)generated by the electrical transitions, allow for clock recov-ery and maintain direct current (DC) balance. An orthogonalconcatenated coding scheme is used for correcting transmissionerrors using modified Bose–Chaudhuri–Hocquenghem (BCH)code capable of correcting all single bit errors and most ofthe double-adjacent errors. As a result all burst errors of alength up to 31 bits, and some of the longer group errors,are corrected within 256 bits long packet. The efficiency of theproposed solution equals 46.48%, as 119 out of 256 bits arefully available to the user. The design has been implementedand tested on Xilinx Kintex UltraScale+ KCU116 Evaluation Kitwith a data rate of 28.2 Gbps. Sample latency analysis has alsobeen performed so that user could easily carry out calculationsfor different transmission speed. The main advancement of thework is the use of modified BCH(15, 11) code that leads to higherror correction capabilities for burst errors and user friendlypacket length

    A Framework for the Design and Analysis of High-Performance Applications on FPGAs using Partial Reconfiguration

    Get PDF
    The field-programmable gate array (FPGA) is a dynamically reconfigurable digital logic chip used to implement custom hardware. The large densities of modern FPGAs and the capability of the on-thely reconfiguration has made the FPGA a viable alternative to fixed logic hardware chips such as the ASIC. In high-performance computing, FPGAs are used as co-processors to speed up computationally intensive processes or as autonomous systems that realize a complete hardware application. However, due to the limited capacity of FPGA logic resources, denser FPGAs must be purchased if more logic resources are required to realize all the functions of a complex application. Alternatively, partial reconfiguration (PR) can be used to swap, on demand, idle components of the application with active components. This research uses PR to swap components to improve the performance of the application given the limited logic resources available with smaller but economical FPGAs. The swap is called ”resource sharing PR”. In a pipelined design of multiple hardware modules (pipeline stages), resource sharing PR is a technique that uses PR to improve the performance of pipeline bottlenecks. This is done by reconfiguring other pipeline stages, typically those that are idle waiting for data from a bottleneck, into an additional parallel bottleneck module. The target pipeline of this research is a two-stage “slow-toast” pipeline where the flow of data traversing the pipeline transitions from a relatively slow, bottleneck stage to a fast stage. A two stage pipeline that combines FPGA-based hardware implementations of well-known Bioinformatics search algorithms, the X! Tandem algorithm and the Smith-Waterman algorithm, is implemented for this research; the implemented pipeline demonstrates that characteristics of these algorithm. The experimental results show that, in a database of unknown peptide spectra, when matching spectra with 388 peaks or greater, performing resource sharing PR to instantiate a parallel X! Tandem module is worth the cost for PR. In addition, from timings gathered during experiments, a general formula was derived for determining the value of performing PR upon a fast module

    Efficient Smart CMOS Camera Based on FPGAs Oriented to Embedded Image Processing

    Get PDF
    This article describes an image processing system based on an intelligent ad-hoc camera, whose two principle elements are a high speed 1.2 megapixel Complementary Metal Oxide Semiconductor (CMOS) sensor and a Field Programmable Gate Array (FPGA). The latter is used to control the various sensor parameter configurations and, where desired, to receive and process the images captured by the CMOS sensor. The flexibility and versatility offered by the new FPGA families makes it possible to incorporate microprocessors into these reconfigurable devices, and these are normally used for highly sequential tasks unsuitable for parallelization in hardware. For the present study, we used a Xilinx XC4VFX12 FPGA, which contains an internal Power PC (PPC) microprocessor. In turn, this contains a standalone system which manages the FPGA image processing hardware and endows the system with multiple software options for processing the images captured by the CMOS sensor. The system also incorporates an Ethernet channel for sending processed and unprocessed images from the FPGA to a remote node. Consequently, it is possible to visualize and configure system operation and captured and/or processed images remotely

    ăƒ“ăƒƒăƒˆăƒžăƒƒăƒ—ă‚€ăƒłăƒ‡ăƒƒă‚Żă‚čにćŸșă„ăăƒ‡ăƒŒă‚żè§ŁæžăźăŸă‚ăźăƒăƒŒăƒ‰ă‚Šă‚§ă‚ąă‚·ă‚čăƒ†ăƒ ă«é–ąă™ă‚‹ç ”ç©¶

    Get PDF
    Recent years have witnessed a massive growth of global data generated from web services, social media networks, and science experiments, as well as the  “tsunami" of Internet-of-Things devices. According to a Cisco forecast, total data center traffic is projected to hit 15.3 zettabytes (ZB) by the end of 2020. Gaining insight into a vast amount of data is highly important because valuable data are the driving force for business decisions and processes, as well as scientists\u27 exploration and discovery.To facilitate analytics, data are usually indexed in advance. Depending on the workloads, such as online transaction processing (OLTP) workloads and online analytics processing (OLAP) workloads, several indexing frameworks have been proposed. Specifically, B+-tree and hash are two common indexing methods in OLTP, where the number of querying and updating processes are nearly similar. Unlike OLTP, OLAP concentrates on querying in a huge historical storage, where updating processes are irregular. Most queries in OLAP are also highly complex and involve aggregations, while the execution time is often limited. To address these challenges, a bitmap index (BI) was proposed and has been proven as a promising candidate for OLAP-like workloads.A BI is a bit-level matrix, whose number of rows and columns are the length and cardinality of the datasets, respectively. With a BI, answering multi-dimensional queries becomes a series of bitwise operators, e.g. AND, OR, XOR, and NOT, on bit columns. As a result, a BI has proven profitable for solving complex queries in large enterprise databases and scientific databases. More significantly, because of the usage of low-hardware logical operators, a BI appears to be suitable for advanced parallel-processing platforms, such as multi-core CPUs, graphics processing units (GPUs), field-programmable logic arrays (FPGAs), and application-specific integrated circuits (ASIC).Modern FPGAs and ASICs have become increasingly important in data analytics because they can confront both data-intensive and computing-intensive tasks effectively. Furthermore, FPGAs and ASICs can provide higher energy efficiency, compared to CPUs and GPUs. As a result, since 2010, Microsoft has been working on the so-called Catapult project, where FPGAs were integrated into datacenter servers to accelerate their search engine as well as AI applications. In 2016, Oracle for the first time introduced SPARC S7 and M7 processors that are used for accelerating the OLTP databases. Nonetheless, a study on the feasibility of BI-based analytics systems using FPGAs and ASICs has not yet been developed.This dissertation, therefore, focuses on implementing the data analytics systems, in both FPGAs and ASICs, using BI. The advantages of the proposed systems include scalability, low data input/output cost, high processing throughput, and high energy efficiency. Three main modules are proposed: (1) a BI creator that indexes the given records by a list of keys and outputs the BI vectors to the external memory; (2) a BI-based query processor that employs the given BI vectors to answer users\u27 queries and outputs the results to the external memory; and (3) an BI encoder that returns the positions of one-bits of bitmap results to the external memory. Six hardware systems based on those three modules are implemented in an FPGA in advance for functional verification and then partially in two ASICs|180-nm bulk complementary metal-oxide-semiconductor (CMOS) and 65-nm Silicon-On-Thin-Buried-Oxide (SOTB) CMOS technology―for physical design verification. Based on the experimental results, these proposed systems outperform other CPU-based and GPU-based designs, especially in terms of energy efficiency.é›»æ°—é€šäżĄć€§ć­Š201

    Real-Time High-Resolution Multiple-Camera Depth Map Estimation Hardware and Its Applications

    Get PDF
    Depth information is used in a variety of 3D based signal processing applications such as autonomous navigation of robots and driving systems, object detection and tracking, computer games, 3D television, and free view-point synthesis. These applications require high accuracy and speed performances for depth estimation. Depth maps can be generated using disparity estimation methods, which are obtained from stereo matching between multiple images. The computational complexity of disparity estimation algorithms and the need of large size and bandwidth for the external and internal memory make the real-time processing of disparity estimation challenging, especially for high resolution images. This thesis proposes a high-resolution high-quality multiple-camera depth map estimation hardware. The proposed hardware is verified in real-time with a complete system from the initial image capture to the display and applications. The details of the complete system are presented. The proposed binocular and trinocular adaptive window size disparity estimation algorithms are carefully designed to be suitable to real-time hardware implementation by allowing efficient parallel and local processing while providing high-quality results. The proposed binocular and trinocular disparity estimation hardware implementations can process 55 frames per second on a Virtex-7 FPGA at a 1024 x 768 XGA video resolution for a 128 pixel disparity range. The proposed binocular disparity estimation hardware provides best quality compared to existing real-time high-resolution disparity estimation hardware implementations. A novel compressed-look up table based rectification algorithm and its real-time hardware implementation are presented. The low-complexity decompression process of the rectification hardware utilizes a negligible amount of LUT and DFF resources of the FPGA while it does not require the existence of external memory. The first real-time high-resolution free viewpoint synthesis hardware utilizing three-camera disparity estimation is presented. The proposed hardware generates high-quality free viewpoint video in real-time for any horizontally aligned arbitrary camera positioned between the leftmost and rightmost physical cameras. The full embedded system of the depth estimation is explained. The presented embedded system transfers disparity results together with synchronized RGB pixels to the PC for application development. Several real-time applications are developed on a PC using the obtained RGB+D results. The implemented depth estimation based real-time software applications are: depth based image thresholding, speed and distance measurement, head-hands-shoulders tracking, virtual mouse using hand tracking and face tracking integrated with free viewpoint synthesis. The proposed binocular disparity estimation hardware is implemented in an ASIC. The ASIC implementation of disparity estimation imposes additional constraints with respect to the FPGA implementation. These restrictions, their implemented efficient solutions and the ASIC implementation results are presented. In addition, a very high-resolution (82.3 MP) 360°x90° omnidirectional multiple camera system is proposed. The hemispherical camera system is able to view the target locations close to horizontal plane with more than two cameras. Therefore, it can be used in high-resolution 360° depth map estimation and its applications in the future
    corecore