51 research outputs found

    FLiMS: a fast lightweight 2-way merger for sorting

    Get PDF
    In this paper, we present FLiMS, a highly-efficient and simple parallel algorithm for merging two sorted lists residing in banked and/or wide memory. On FPGAs, its implementation uses fewer hardware resources than the state-of-the-art alternatives, due to the reduced number of comparators and elimination of redundant logic found on prior attempts. In combination with the distributed nature of the selector stage, a higher performance is achieved for the same amount of parallelism or higher. This is useful in many applications such as in parallel merge trees to achieve high-throughput sorting, where the resource utilisation of the merger is critical for building larger trees and internalising the workload for faster computation. Also presented are efficient variations of FLiMS for optimizing throughput for skewed datasets, achieving stable sorting or using fewer dequeue signals. FLiMS is also shown to perform well as conventional software on modern CPUs supporting single-instruction multiple-data (SIMD) instructions, surpassing the performance of some standard libraries for sorting

    Database System Acceleration on FPGAs

    Get PDF
    Relational database systems provide various services and applications with an efficient means for storing, processing, and retrieving their data. The performance of these systems has a direct impact on the quality of service of the applications that rely on them. Therefore, it is crucial that database systems are able to adapt and grow in tandem with the demands of these applications, ensuring that their performance scales accordingly. In the past, Moore's law and algorithmic advancements have been sufficient to meet these demands. However, with the slowdown of Moore's law, researchers have begun exploring alternative methods, such as application-specific technologies, to satisfy the more challenging performance requirements. One such technology is field-programmable gate arrays (FPGAs), which provide ideal platforms for developing and running custom architectures for accelerating database systems. The goal of this thesis is to develop a domain-specific architecture that can enhance the performance of in-memory database systems when executing analytical queries. Our research is guided by a combination of academic and industrial requirements that seek to strike a balance between generality and performance. The former ensures that our platform can be used to process a diverse range of workloads, while the latter makes it an attractive solution for high-performance use cases. Throughout this thesis, we present the development of a system-on-chip for database system acceleration that meets our requirements. The resulting architecture, called CbMSMK, is capable of processing the projection, sort, aggregation, and equi-join database operators and can also run some complex TPC-H queries. CbMSMK employs a shared sort-merge pipeline for executing all these operators, which results in an efficient use of FPGA resources. This approach enables the instantiation of multiple acceleration cores on the FPGA, allowing it to serve multiple clients simultaneously. CbMSMK can process both arbitrarily deep and wide tables efficiently. The former is achieved through the use of the sort-merge algorithm which utilizes the FPGA RAM for buffering intermediate sort results. The latter is achieved through the use of KeRRaS, a novel variant of the forward radix sort algorithm introduced in this thesis. KeRRaS allows CbMSMK to process a table a few columns at a time, incrementally generating the final result through multiple iterations. Given that acceleration is a key objective of our work, CbMSMK benefits from many performance optimizations. For instance, multi-way merging is employed to reduce the number of merge passes required for the execution of the sort-merge algorithm, thus improving the performance of all our pipeline-breaking operators. Another example is our in-depth analysis of early aggregation, which led to the development of a novel cache-based algorithm that significantly enhances aggregation performance. Our experiments demonstrate that CbMSMK performs on average 5 times faster than the state-of-the-art CPU-based database management system MonetDB.:I Database Systems & FPGAs 1 INTRODUCTION 1.1 Databases & the Importance of Performance 1.2 Accelerators & FPGAs 1.3 Requirements 1.4 Outline & Summary of Contributions 2 BACKGROUND ON DATABASE SYSTEMS 2.1 Databases 2.1.1 Storage Model 2.1.2 Storage Medium 2.2 Database Operators 2.2.1 Projection 2.2.2 Filter 2.2.3 Sort 2.2.4 Aggregation 2.2.5 Join 2.2.6 Operator Classification 2.3 Database Queries 2.4 Impact of Acceleration 3 BACKGROUND ON FPGAS 3.1 FPGA 3.1.1 Logic Element 3.1.2 Block RAM (BRAM) 3.1.3 Digital Signal Processor (DSP) 3.1.4 IO Element 3.1.5 Programmable Interconnect 3.2 FPGADesignFlow 3.2.1 Specifications 3.2.2 RTL Description 3.2.3 Verification 3.2.4 Synthesis, Mapping, Placement, and Routing 3.2.5 TimingAnalysis 3.2.6 Bitstream Generation and FPGA Programming 3.3 Implementation Quality Metrics 3.4 FPGA Cards 3.5 Benefits of Using FPGAs 3.6 Challenges of Using FPGAs 4 RELATED WORK 4.1 Summary of Related Work 4.2 Platform Type 4.2.1 Accelerator Card 4.2.2 Coprocessor 4.2.3 Smart Storage 4.2.4 Network Processor 4.3 Implementation 4.3.1 Loop-based implementation 4.3.2 Sort-based Implementation 4.3.3 Hash-based Implementation 4.3.4 Mixed Implementation 4.4 A Note on Quantitative Performance Comparisons II Cache-Based Morphing Sort-Merge with KeRRaS (CbMSMK) 5 OBJECTIVES AND ARCHITECTURE OVERVIEW 5.1 From Requirements to Objectives 5.2 Architecture Overview 5.3 Outlineof Part II 6 COMPARATIVE ANALYSIS OF OPENCL AND RTL FOR SORT-MERGE PRIMITIVES ON FPGAS 6.1 Programming FPGAs 6.2 RelatedWork 6.3 Architecture 6.3.1 Global Architecture 6.3.2 Sorter Architecture 6.3.3 Merger Architecture 6.3.4 Scalability and Resource Adaptability 6.4 Experiments 6.4.1 OpenCL Sort-Merge Implementation 6.4.2 RTLSorters 6.4.3 RTLMergers 6.4.4 Hybrid OpenCL-RTL Sort-Merge Implementation 6.5 Summary & Discussion 7 RESOURCE-EFFICIENT ACCELERATION OF PIPELINE-BREAKING DATABASE OPERATORS ON FPGAS 7.1 The Case for Resource Efficiency 7.2 Related Work 7.3 Architecture 7.3.1 Sorters 7.3.2 Sort-Network 7.3.3 X:Y Mergers 7.3.4 Merge-Network 7.3.5 Join Materialiser (JoinMat) 7.4 Experiments 7.4.1 Experimental Setup 7.4.2 Implementation Description & Tuning 7.4.3 Sort Benchmarks 7.4.4 Aggregation Benchmarks 7.4.5 Join Benchmarks 7. Summary 8 KERRAS: COLUMN-ORIENTED WIDE TABLE PROCESSING ON FPGAS 8.1 The Scope of Database System Accelerators 8.2 Related Work 8.3 Key-Reduce Radix Sort(KeRRaS) 8.3.1 Time Complexity 8.3.2 Space Complexity (Memory Utilization) 8.3.3 Discussion and Optimizations 8.4 Architecture 8.4.1 MSM 8.4.2 MSMK: Extending MSM with KeRRaS 8.4.3 Payload, Aggregation and Join Processing 8.4.4 Limitations 8.5 Experiments 8.5.1 Experimental Setup 8.5.2 Datasets 8.5.3 MSMK vs. MSM 8.5.4 Payload-Less Benchmarks 8.5.5 Payload-Based Benchmarks 8.5.6 Flexibility 8.6 Summary 9 A STUDY OF EARLY AGGREGATION IN DATABASE QUERY PROCESSING ON FPGAS 9.1 Early Aggregation 9.2 Background & Related Work 9.2.1 Sort-Based Early Aggregation 9.2.2 Cache-Based Early Aggregation 9.3 Simulations 9.3.1 Datasets 9.3.2 Metrics 9.3.3 Sort-Based Versus Cache-Based Early Aggregation 9.3.4 Comparison of Set-Associative Caches 9.3.5 Comparison of Cache Structures 9.3.6 Comparison of Replacement Policies 9.3.7 Cache Selection Methodology 9.4 Cache System Architecture 9.4.1 Window Aggregator 9.4.2 Compressor & Hasher 9.4.3 Collision Detector 9.4.4 Collision Resolver 9.4.5 Cache 9.5 Experiments 9.5.1 Experimental Setup 9.5.2 Resource Utilization and Parameter Tuning 9.5.3 Datasets 9.5.4 Benchmarks on Synthetic Data 9.5.5 Benchmarks on Real Data 9.6 Summary 10 THE FULL PICTURE 10.1 System Architecture 10.2 Benchmarks 10.3 Meeting the Objectives III Conclusion 11 SUMMARY AND OUTLOOK ON FUTURE RESEARCH 11.1 Summary 11.2 Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLE

    Co-design Hardware and Algorithm for Vector Search

    Full text link
    Vector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents. As performance demands for vector search systems surge, accelerated hardware offers a promising solution in the post-Moore's Law era. We introduce \textit{FANNS}, an end-to-end and scalable vector search framework on FPGAs. Given a user-provided recall requirement on a dataset and a hardware resource budget, \textit{FANNS} automatically co-designs hardware and algorithm, subsequently generating the corresponding accelerator. The framework also supports scale-out by incorporating a hardware TCP/IP stack in the accelerator. \textit{FANNS} attains up to 23.0×\times and 37.2×\times speedup compared to FPGA and CPU baselines, respectively, and demonstrates superior scalability to GPUs, achieving 5.5×\times and 7.6×\times speedup in median and 95\textsuperscript{th} percentile (P95) latency within an eight-accelerator configuration. The remarkable performance of \textit{FANNS} lays a robust groundwork for future FPGA integration in data centers and AI supercomputers.Comment: 11 page

    Processamento de dados em Zynq APSoC

    Get PDF
    Mestrado em Engenharia de Computadores e TelemáticaField-Programmable Gate Arrays (FPGAs) were invented by Xilinx in 1985, i.e. less than 30 years ago. The influence of FPGAs on many directions in engineering is growing continuously and rapidly. There are many reasons for such progress and the most important are the inherent reconfigurability of FPGAs and relatively cheap development cost. Recent field-configurable micro-chips combine the capabilities of software and hardware by incorporating multi-core processors and reconfigurable logic enabling the development of highly optimized computational systems for a vast variety of practical applications, including high-performance computing, data, signal and image processing, embedded systems, and many others. In this context, the main goals of the thesis are to study the new micro-chips, namely the Zynq-7000 family and to apply them to two selected case studies: data sort and Hamming weight calculation for long vectors.Field-Programmable Gate Arrays (FPGAs) foram inventadas pela Xilinx em 1985, ou seja, há menos de 30 anos. A influência das FPGAs está a crescer continua e rapidamente em muitos ramos de engenharia. Há varias razões para esta evolução, as mais importantes são a sua capacidade de reconfiguração inerente e os baixos custos de desenvolvimento. Os micro-chips mais recentes baseados em FPGAs combinam capacidades de software e hardware através da incorporação de processadores multi-core e lógica reconfigurável permitindo o desenvolvimento de sistemas computacionais altamente otimizados para uma grande variedade de aplicações práticas, incluindo computação de alto desempenho, processamento de dados, de sinal e imagem, sistemas embutidos, e muitos outros. Neste contexto, este trabalho tem como o objetivo principal estudar estes novos micro-chips, nomeadamente a família Zynq-7000, para encontrar as melhores formas de potenciar as vantagens deste sistema usando casos de estudo como ordenação de dados e cálculo do peso de Hamming para vetores longos

    Advanced analytics through FPGA based query processing and deep reinforcement learning

    Get PDF
    Today, vast streams of structured and unstructured data have been incorporated in databases, and analytical processes are applied to discover patterns, correlations, trends and other useful relationships that help to take part in a broad range of decision-making processes. The amount of generated data has grown very large over the years, and conventional database processing methods from previous generations have not been sufficient to provide satisfactory results regarding analytics performance and prediction accuracy metrics. Thus, new methods are needed in a wide array of fields from computer architectures, storage systems, network design to statistics and physics. This thesis proposes two methods to address the current challenges and meet the future demands of advanced analytics. First, we present AxleDB, a Field Programmable Gate Array based query processing system which constitutes the frontend of an advanced analytics system. AxleDB melds highly-efficient accelerators with memory, storage and provides a unified programmable environment. AxleDB is capable of offloading complex Structured Query Language queries from host CPU. The experiments have shown that running a set of TPC-H queries, AxleDB can perform full queries between 1.8x and 34.2x faster and 2.8x to 62.1x more energy efficient compared to MonetDB, and PostgreSQL on a single workstation node. Second, we introduce TauRieL, a novel deep reinforcement learning (DRL) based method for combinatorial problems. The design idea behind combining DRL and combinatorial problems is to apply the prediction capabilities of deep reinforcement learning and to use the universality of combinatorial optimization problems to explore general purpose predictive methods. TauRieL utilizes an actor-critic inspired DRL architecture that adopts ordinary feedforward nets. Furthermore, TauRieL performs online training which unifies training and state space exploration. The experiments show that TauRieL can generate solutions two orders of magnitude faster and performs within 3% of accuracy compared to the state-of-the-art DRL on the Traveling Salesman Problem while searching for the shortest tour. Also, we present that TauRieL can be adapted to the Knapsack combinatorial problem. With a very minimal problem specific modification, TauRieL can outperform a Knapsack specific greedy heuristics.Hoy en día, se han incorporado grandes cantidades de datos estructurados y no estructurados en las bases de datos, y se les aplican procesos analíticos para descubrir patrones, correlaciones, tendencias y otras relaciones útiles que se utilizan mayormente para la toma de decisiones. La cantidad de datos generados ha crecido enormemente a lo largo de los años, y los métodos de procesamiento de bases de datos convencionales utilizados en las generaciones anteriores no son suficientes para proporcionar resultados satisfactorios respecto al rendimiento del análisis y respecto de la precisión de las predicciones. Por lo tanto, se necesitan nuevos métodos en una amplia gama de campos, desde arquitecturas de computadoras, sistemas de almacenamiento, diseño de redes hasta estadísticas y física. Esta tesis propone dos métodos para abordar los desafíos actuales y satisfacer las demandas futuras de análisis avanzado. Primero, presentamos AxleDB, un sistema de procesamiento de consultas basado en FPGAs (Field Programmable Gate Array) que constituye la interfaz de un sistema de análisis avanzado. AxleDB combina aceleradores altamente eficientes con memoria, almacenamiento y proporciona un entorno programable unificado. AxleDB es capaz de descargar consultas complejas de lenguaje de consulta estructurado desde la CPU del host. Los experimentos han demostrado que al ejecutar un conjunto de consultas TPC-H, AxleDB puede realizar consultas completas entre 1.8x y 34.2x más rápido y 2.8x a 62.1x más eficiente energéticamente que MonetDB, y PostgreSQL en un solo nodo de una estación de trabajo. En segundo lugar, presentamos TauRieL, un nuevo método basado en Deep Reinforcement Learning (DRL) para problemas combinatorios. La idea central que está detrás de la combinación de DRL y problemas combinatorios, es aplicar las capacidades de predicción del aprendizaje de refuerzo profundo y el uso de la universalidad de los problemas de optimización combinatoria para explorar métodos predictivos de propósito general. TauRieL utiliza una arquitectura DRL inspirada en el actor-crítico que se adapta a redes feedforward. Además, TauRieL realiza el entrenamieton en línea que unifica el entrenamiento y la exploración espacial de los estados. Los experimentos muestran que TauRieL puede generar soluciones dos órdenes de magnitud más rápido y funciona con un 3% de precisión en comparación con el estado del arte en DRL aplicado al problema del viajante mientras busca el recorrido más corto. Además, presentamos que TauRieL puede adaptarse al problema de la Mochila. Con una modificación específica muy mínima del problema, TauRieL puede superar a una heurística codiciosa de Knapsack Problem.Postprint (published version

    Acceleration of k-Nearest Neighbor and SRAD Algorithms Using Intel FPGA SDK for OpenCL

    Get PDF
    Field Programmable Gate Arrays (FPGAs) have been widely used for accelerating machine learning algorithms. However, the high design cost and time for implementing FPGA-based accelerators using traditional HDL-based design methodologies has discouraged users from designing FPGA-based accelerators. In recent years, a new CAD tool called Intel FPGA SDK for OpenCL (IFSO) allowed fast and efficient design of FPGA-based hardware accelerators from high level specification such as OpenCL. Even software engineers with basic hardware design knowledge could design FPGA-based accelerators. In this thesis, IFSO has been used to explore acceleration of k-Nearest-Neighbour (kNN) algorithm and Speckle Reducing Anisotropic Diffusion (SRAD) simulation using FPGAs. kNN is a popular algorithm used in machine learning. Bitonic sorting and radix sorting algorithms were used in the kNN algorithm to check if these provide any performance improvements. Acceleration of SRAD simulation was also explored. The experimental results obtained for these algorithms from FPGA-based acceleration were compared with the state of the art CPU implementation. The optimized algorithms were implemented on two different FPGAs (Intel Stratix A7 and Intel Arria 10 GX). Experimental results show that the FPGA-based accelerators provided similar or better execution time (up to 80X) and better power efficiency (75% reduction in power consumption) than traditional platforms such as a workstation based on two Intel Xeon processors E5-2620 Series (each with 6 cores and running at 2.4 GHz)
    corecore