96 research outputs found

    マルチノードFPGAによるストリームデータ分散結合処理

    Get PDF
     多彩なWebサービスの普及とセンサ技術の発達にともない,データセンタに収集されるデータの速度と量は増大を続けている.金融情報処理やネットワークトラフィック監視のようなアプリケーションでは途切れなく到着するストリームデータに対して厳しい時間的制約が課せられるため,従来のデータ処理モデルではインターフェイスがボトルネックになるなど十分な処理性能を実現できない. リアルタイム性が求められるアプリケーションを処理する仕組みであるData Stream Management System(DSMS)に対して性能向上を目的としてField Programmable Gate Array(FPGA)を利用した支援HWに関する研究が数多く行われている.その中でもストリームデータ結合処理の並列処理アルゴリズムであるHandshake JoinのHW実装では,CPU処理と比較して高い並列性とスループットを実現している.しかしFPGAで実行できる処理の規模を示すウィンドウサイズの拡大は単一のFPGAがもつリソース量に制約を持つという問題がある. 本研究報告では,ストリームデータの結合処理を行うHandshake JoinのFPGAアクセラレータをマルチノードに拡張する手法を提案し,その性能評価を報告する.マルチノード拡張は,データ通信の2つの工夫によって実現する.FPGA上に複数のJoin Core を実装すると共に,FPGAボード上のDRAMを介してFPGA間でJoin Coreを接続する仕組みを導入する.またデータ通信と結合演算をオーバラップすることで,通信に要するオーバヘッドを隠蔽する.これらのデータ配布方法の工夫により,単一のFPGAでは実装できなかった大きなウィンドウサイズの結合演算が可能となる.最大16ノードでの性能評価の結果,マルチノードに拡張した場合においても,結合演算のスループットが維持されるとともに,並列化効果が得られることが確認された.電気通信大学201

    Database System Acceleration on FPGAs

    Get PDF
    Relational database systems provide various services and applications with an efficient means for storing, processing, and retrieving their data. The performance of these systems has a direct impact on the quality of service of the applications that rely on them. Therefore, it is crucial that database systems are able to adapt and grow in tandem with the demands of these applications, ensuring that their performance scales accordingly. In the past, Moore's law and algorithmic advancements have been sufficient to meet these demands. However, with the slowdown of Moore's law, researchers have begun exploring alternative methods, such as application-specific technologies, to satisfy the more challenging performance requirements. One such technology is field-programmable gate arrays (FPGAs), which provide ideal platforms for developing and running custom architectures for accelerating database systems. The goal of this thesis is to develop a domain-specific architecture that can enhance the performance of in-memory database systems when executing analytical queries. Our research is guided by a combination of academic and industrial requirements that seek to strike a balance between generality and performance. The former ensures that our platform can be used to process a diverse range of workloads, while the latter makes it an attractive solution for high-performance use cases. Throughout this thesis, we present the development of a system-on-chip for database system acceleration that meets our requirements. The resulting architecture, called CbMSMK, is capable of processing the projection, sort, aggregation, and equi-join database operators and can also run some complex TPC-H queries. CbMSMK employs a shared sort-merge pipeline for executing all these operators, which results in an efficient use of FPGA resources. This approach enables the instantiation of multiple acceleration cores on the FPGA, allowing it to serve multiple clients simultaneously. CbMSMK can process both arbitrarily deep and wide tables efficiently. The former is achieved through the use of the sort-merge algorithm which utilizes the FPGA RAM for buffering intermediate sort results. The latter is achieved through the use of KeRRaS, a novel variant of the forward radix sort algorithm introduced in this thesis. KeRRaS allows CbMSMK to process a table a few columns at a time, incrementally generating the final result through multiple iterations. Given that acceleration is a key objective of our work, CbMSMK benefits from many performance optimizations. For instance, multi-way merging is employed to reduce the number of merge passes required for the execution of the sort-merge algorithm, thus improving the performance of all our pipeline-breaking operators. Another example is our in-depth analysis of early aggregation, which led to the development of a novel cache-based algorithm that significantly enhances aggregation performance. Our experiments demonstrate that CbMSMK performs on average 5 times faster than the state-of-the-art CPU-based database management system MonetDB.:I Database Systems & FPGAs 1 INTRODUCTION 1.1 Databases & the Importance of Performance 1.2 Accelerators & FPGAs 1.3 Requirements 1.4 Outline & Summary of Contributions 2 BACKGROUND ON DATABASE SYSTEMS 2.1 Databases 2.1.1 Storage Model 2.1.2 Storage Medium 2.2 Database Operators 2.2.1 Projection 2.2.2 Filter 2.2.3 Sort 2.2.4 Aggregation 2.2.5 Join 2.2.6 Operator Classification 2.3 Database Queries 2.4 Impact of Acceleration 3 BACKGROUND ON FPGAS 3.1 FPGA 3.1.1 Logic Element 3.1.2 Block RAM (BRAM) 3.1.3 Digital Signal Processor (DSP) 3.1.4 IO Element 3.1.5 Programmable Interconnect 3.2 FPGADesignFlow 3.2.1 Specifications 3.2.2 RTL Description 3.2.3 Verification 3.2.4 Synthesis, Mapping, Placement, and Routing 3.2.5 TimingAnalysis 3.2.6 Bitstream Generation and FPGA Programming 3.3 Implementation Quality Metrics 3.4 FPGA Cards 3.5 Benefits of Using FPGAs 3.6 Challenges of Using FPGAs 4 RELATED WORK 4.1 Summary of Related Work 4.2 Platform Type 4.2.1 Accelerator Card 4.2.2 Coprocessor 4.2.3 Smart Storage 4.2.4 Network Processor 4.3 Implementation 4.3.1 Loop-based implementation 4.3.2 Sort-based Implementation 4.3.3 Hash-based Implementation 4.3.4 Mixed Implementation 4.4 A Note on Quantitative Performance Comparisons II Cache-Based Morphing Sort-Merge with KeRRaS (CbMSMK) 5 OBJECTIVES AND ARCHITECTURE OVERVIEW 5.1 From Requirements to Objectives 5.2 Architecture Overview 5.3 Outlineof Part II 6 COMPARATIVE ANALYSIS OF OPENCL AND RTL FOR SORT-MERGE PRIMITIVES ON FPGAS 6.1 Programming FPGAs 6.2 RelatedWork 6.3 Architecture 6.3.1 Global Architecture 6.3.2 Sorter Architecture 6.3.3 Merger Architecture 6.3.4 Scalability and Resource Adaptability 6.4 Experiments 6.4.1 OpenCL Sort-Merge Implementation 6.4.2 RTLSorters 6.4.3 RTLMergers 6.4.4 Hybrid OpenCL-RTL Sort-Merge Implementation 6.5 Summary & Discussion 7 RESOURCE-EFFICIENT ACCELERATION OF PIPELINE-BREAKING DATABASE OPERATORS ON FPGAS 7.1 The Case for Resource Efficiency 7.2 Related Work 7.3 Architecture 7.3.1 Sorters 7.3.2 Sort-Network 7.3.3 X:Y Mergers 7.3.4 Merge-Network 7.3.5 Join Materialiser (JoinMat) 7.4 Experiments 7.4.1 Experimental Setup 7.4.2 Implementation Description & Tuning 7.4.3 Sort Benchmarks 7.4.4 Aggregation Benchmarks 7.4.5 Join Benchmarks 7. Summary 8 KERRAS: COLUMN-ORIENTED WIDE TABLE PROCESSING ON FPGAS 8.1 The Scope of Database System Accelerators 8.2 Related Work 8.3 Key-Reduce Radix Sort(KeRRaS) 8.3.1 Time Complexity 8.3.2 Space Complexity (Memory Utilization) 8.3.3 Discussion and Optimizations 8.4 Architecture 8.4.1 MSM 8.4.2 MSMK: Extending MSM with KeRRaS 8.4.3 Payload, Aggregation and Join Processing 8.4.4 Limitations 8.5 Experiments 8.5.1 Experimental Setup 8.5.2 Datasets 8.5.3 MSMK vs. MSM 8.5.4 Payload-Less Benchmarks 8.5.5 Payload-Based Benchmarks 8.5.6 Flexibility 8.6 Summary 9 A STUDY OF EARLY AGGREGATION IN DATABASE QUERY PROCESSING ON FPGAS 9.1 Early Aggregation 9.2 Background & Related Work 9.2.1 Sort-Based Early Aggregation 9.2.2 Cache-Based Early Aggregation 9.3 Simulations 9.3.1 Datasets 9.3.2 Metrics 9.3.3 Sort-Based Versus Cache-Based Early Aggregation 9.3.4 Comparison of Set-Associative Caches 9.3.5 Comparison of Cache Structures 9.3.6 Comparison of Replacement Policies 9.3.7 Cache Selection Methodology 9.4 Cache System Architecture 9.4.1 Window Aggregator 9.4.2 Compressor & Hasher 9.4.3 Collision Detector 9.4.4 Collision Resolver 9.4.5 Cache 9.5 Experiments 9.5.1 Experimental Setup 9.5.2 Resource Utilization and Parameter Tuning 9.5.3 Datasets 9.5.4 Benchmarks on Synthetic Data 9.5.5 Benchmarks on Real Data 9.6 Summary 10 THE FULL PICTURE 10.1 System Architecture 10.2 Benchmarks 10.3 Meeting the Objectives III Conclusion 11 SUMMARY AND OUTLOOK ON FUTURE RESEARCH 11.1 Summary 11.2 Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLE

    Self-timed field programmmable gate array architectures

    Get PDF

    An Efficient and Scalable Implementation of Sliding-Window Aggregate Operator on FPGA

    Get PDF
    This paper presents an efficient and scalable implementation of an FPGA-based accelerator for sliding-window aggregates over disordered data streams. With an increasing number of overlapping sliding-windows, the window aggregates have a serious scalability issue, especially when it comes to implementing them in parallel processing hardware (e.g., FPGAs). To address the issue, we propose a resource-ef?cient, scalable, and order-agnostic hardware design and its implementation by examining and integrating two key concepts, called Window-ID and Pane, which are originally proposed for software implementation, respectively. Evaluation results show that the proposed implementation scales well compared to the previous FPGA implementation in terms of both resource consumption and performance. The proposed design is fully pipelined and our implementation can process out-of-order data items, or tuples, at wire speed up to 200 million tuples per second

    Hardware-conscious query processing for the many-core era

    Get PDF
    Die optimale Nutzung von moderner Hardware zur Beschleunigung von Datenbank-Anfragen ist keine triviale Aufgabe. Viele DBMS als auch DSMS der letzten Jahrzehnte basieren auf Sachverhalten, die heute kaum noch Gültigkeit besitzen. Ein Beispiel hierfür sind heutige Server-Systeme, deren Hauptspeichergröße im Bereich mehrerer Terabytes liegen kann und somit den Weg für Hauptspeicherdatenbanken geebnet haben. Einer der größeren letzten Hardware Trends geht hin zu Prozessoren mit einer hohen Anzahl von Kernen, den sogenannten Manycore CPUs. Diese erlauben hohe Parallelitätsgrade für Programme durch Multithreading sowie Vektorisierung (SIMD), was die Anforderungen an die Speicher-Bandbreite allerdings deutlich erhöht. Der sogenannte High-Bandwidth Memory (HBM) versucht diese Lücke zu schließen, kann aber ebenso wie Many-core CPUs jeglichen Performance-Vorteil negieren, wenn dieser leichtfertig eingesetzt wird. Diese Arbeit stellt die Many-core CPU-Architektur zusammen mit HBM vor, um Datenbank sowie Datenstrom-Anfragen zu beschleunigen. Es wird gezeigt, dass ein hardwarenahes Kostenmodell zusammen mit einem Kalibrierungsansatz die Performance verschiedener Anfrageoperatoren verlässlich vorhersagen kann. Dies ermöglicht sowohl eine adaptive Partitionierungs und Merge-Strategie für die Parallelisierung von Datenstrom-Anfragen als auch eine ideale Konfiguration von Join-Operationen auf einem DBMS. Nichtsdestotrotz ist nicht jede Operation und Anwendung für die Nutzung einer Many-core CPU und HBM geeignet. Datenstrom-Anfragen sind oft auch an niedrige Latenz und schnelle Antwortzeiten gebunden, welche von höherer Speicher-Bandbreite kaum profitieren können. Hinzu kommen üblicherweise niedrigere Taktraten durch die hohe Kernzahl der CPUs, sowie Nachteile für geteilte Datenstrukturen, wie das Herstellen von Cache-Kohärenz und das Synchronisieren von parallelen Thread-Zugriffen. Basierend auf den Ergebnissen dieser Arbeit lässt sich ableiten, welche parallelen Datenstrukturen sich für die Verwendung von HBM besonders eignen. Des Weiteren werden verschiedene Techniken zur Parallelisierung und Synchronisierung von Datenstrukturen vorgestellt, deren Effizienz anhand eines Mehrwege-Datenstrom-Joins demonstriert wird.Exploiting the opportunities given by modern hardware for accelerating query processing speed is no trivial task. Many DBMS and also DSMS from past decades are based on fundamentals that have changed over time, e.g., servers of today with terabytes of main memory capacity allow complete avoidance of spilling data to disk, which has prepared the ground some time ago for main memory databases. One of the recent trends in hardware are many-core processors with hundreds of logical cores on a single CPU, providing an intense degree of parallelism through multithreading as well as vectorized instructions (SIMD). Their demand for memory bandwidth has led to the further development of high-bandwidth memory (HBM) to overcome the memory wall. However, many-core CPUs as well as HBM have many pitfalls that can nullify any performance gain with ease. In this work, we explore the many-core architecture along with HBM for database and data stream query processing. We demonstrate that a hardware-conscious cost model with a calibration approach allows reliable performance prediction of various query operations. Based on that information, we can, therefore, come to an adaptive partitioning and merging strategy for stream query parallelization as well as finding an ideal configuration of parameters for one of the most common tasks in the history of DBMS, join processing. However, not all operations and applications can exploit a many-core processor or HBM, though. Stream queries optimized for low latency and quick individual responses usually do not benefit well from more bandwidth and suffer from penalties like low clock frequencies of many-core CPUs as well. Shared data structures between cores also lead to problems with cache coherence as well as high contention. Based on our insights, we give a rule of thumb which data structures are suitable to parallelize with focus on HBM usage. In addition, different parallelization schemas and synchronization techniques are evaluated, based on the example of a multiway stream join operation

    HAL-ASOS - Linux com aceleração em hardware para sistemas operativos dedicados à aplicação

    Get PDF
    Programa doutoral em Engenharia Eletrónica e de Computadores (PDEEC) (especialidade de Informática Industrial e Sistemas Embebidos)O ecossistema de sistemas embebidos de hoje tornou-se enorme, cobrindo vários e diferentes sistemas, exigindo desempenho e mobilidade completa enquanto atingem autonomias de bateria cada vez maiores. Mas a crescente frequência de relógio que resultou em dispositivos cada vez mais rápidos começou a estagnar antes dos transístores pararem de encolher. Plataformas Field Programmable Gate Array (FPGA) são uma solução alternativa para a implementação de sistemas completos e reconfiguráveis. Fornecem desempenho e eficiência computacional para satisfazer requisitos da aplicação e do sistema embebido. Vários Sistemas Operativos (SO) assistidos por FPGA foram propostos, mas ao estreitar seu foco na síntese do datapath do acelerador de hardware, a grande maioria ignora a integração semântica destes no SO. Ambientes de síntese de alto nível (HLS) elevaram a abstração além da linguagem de transferência de registo (RTL), seguindo uma abordagem específica de domínio enquanto misturam software e abstrações de hardware ad hoc, que dificultam as otimizações. Além disso, os modelos de programação para software e hardware reconfigurável carecem de semelhanças, o que com o tempo dificultará a Exploração do Ambiente de Design (DSE) e diminuirá o potencial de reutilização de código. Para responder a estas necessidades, propomos HAL-ASOS, uma ferramenta para implementar sistemas embebidos baseados em Linux que fornece (1) elasticidade no design em conformidade com a natureza evolutiva deste SO, (2) integração semântica profunda de tarefas de hardware nos modelos de programação do Linux, (3) facilidade na gestão de complexidade através de metodologia e ferramentas para apoiar o design, verificação e implementação, (4) orientada por princípios de design híbridos e eficiência no sistema. Para avaliar as funcionalidades da ferramenta, foi implementado um aplicativo criptográfico que demonstra alcance de desempenho enquanto se emprega a metodologia de design. Novos níveis de desempenho são atingidos numa aplicação de Visão por Computador que explora recursos de programação assíncrona-síncrona. Os resultados demonstram uma abordagem flexível na reconfiguração entre hardware e software, e desempenho que aumenta consistentemente com acréscimo de recursos ou frequência de relógio.Today’s embedded systems ecosystem became huge while covering several and different computer-based systems, demanding for performance and complete mobility while experiencing longer battery lives. But the rampant frequency that resulted in faster devices began hitting a wall even before transistors stopped shrinking. Field Programmable Gate Array (FPGA) platforms are an alternative solution towards implementing complete reconfigurable systems. They provide computational power, efficiency, in a lightweight solution to serve the application requirements and increase performance in the overall system. Several FPGA-assisted Operating Systems (OS) have been proposed, but by narrowing their focus on datapath synthesis of the hardware accelerator, they completely ignore the deep semantic integration of these accelerators into the OS. State-of-the-art High-Level Synthesis (HLS) environments have raised the level of abstraction beyond Register Transfer Language (RTL) by following a domain-specific approach while mixing ad hoc software and hardware abstractions, making harder for performance optimizations. Furthermore, the programming models for software and reconfigurable hardware lack commonalities, which in time will hinder the Design Space Exploration (DSE) and lower the potential for code reuse. To overcome these issues, we propose HAL-ASOS, a framework to implement Linux-based Embedded systems which provides (1) elasticity by design to comply with the evolutive nature of Linux, (2) deep semantic integration of the hardware tasks in the Linux programming models, (3) easy complexity management using methodology and tools to fully support design, verification and deployment, (4) hybrid and efficiency-oriented design principles. To evaluate the framework functionalities, a cryptographic application was implemented and demonstrates performance achievements while using the promoted application-driven design methodology. To demonstrate new levels of performance that can be achieved, a Computer Vision application explores several mixed asynchronous-synchronous programming features. Experiments demonstrate a flexible design approach in terms of hardware and software reconfiguration, and significant performance that increases consistently with the rising in processing resources or clock frequencies.Financial support received from Portuguese Foundation for Science and Technology (FCT) with the PhD grant SFRH/BD/82732/2011

    Quality-Driven Disorder Handling for M-way Sliding Window Stream Joins

    Full text link
    Sliding window join is one of the most important operators for stream applications. To produce high quality join results, a stream processing system must deal with the ubiquitous disorder within input streams which is caused by network delay, asynchronous source clocks, etc. Disorder handling involves an inevitable tradeoff between the latency and the quality of produced join results. To meet different requirements of stream applications, it is desirable to provide a user-configurable result-latency vs. result-quality tradeoff. Existing disorder handling approaches either do not provide such configurability, or support only user-specified latency constraints. In this work, we advocate the idea of quality-driven disorder handling, and propose a buffer-based disorder handling approach for sliding window joins, which minimizes sizes of input-sorting buffers, thus the result latency, while respecting user-specified result-quality requirements. The core of our approach is an analytical model which directly captures the relationship between sizes of input buffers and the produced result quality. Our approach is generic. It supports m-way sliding window joins with arbitrary join conditions. Experiments on real-world and synthetic datasets show that, compared to the state of the art, our approach can reduce the result latency incurred by disorder handling by up to 95% while providing the same level of result quality.Comment: 12 pages, 11 figures, IEEE ICDE 201
    corecore