110 research outputs found

    A Framework for the Design and Analysis of High-Performance Applications on FPGAs using Partial Reconfiguration

    Get PDF
    The field-programmable gate array (FPGA) is a dynamically reconfigurable digital logic chip used to implement custom hardware. The large densities of modern FPGAs and the capability of the on-thely reconfiguration has made the FPGA a viable alternative to fixed logic hardware chips such as the ASIC. In high-performance computing, FPGAs are used as co-processors to speed up computationally intensive processes or as autonomous systems that realize a complete hardware application. However, due to the limited capacity of FPGA logic resources, denser FPGAs must be purchased if more logic resources are required to realize all the functions of a complex application. Alternatively, partial reconfiguration (PR) can be used to swap, on demand, idle components of the application with active components. This research uses PR to swap components to improve the performance of the application given the limited logic resources available with smaller but economical FPGAs. The swap is called ”resource sharing PR”. In a pipelined design of multiple hardware modules (pipeline stages), resource sharing PR is a technique that uses PR to improve the performance of pipeline bottlenecks. This is done by reconfiguring other pipeline stages, typically those that are idle waiting for data from a bottleneck, into an additional parallel bottleneck module. The target pipeline of this research is a two-stage “slow-toast” pipeline where the flow of data traversing the pipeline transitions from a relatively slow, bottleneck stage to a fast stage. A two stage pipeline that combines FPGA-based hardware implementations of well-known Bioinformatics search algorithms, the X! Tandem algorithm and the Smith-Waterman algorithm, is implemented for this research; the implemented pipeline demonstrates that characteristics of these algorithm. The experimental results show that, in a database of unknown peptide spectra, when matching spectra with 388 peaks or greater, performing resource sharing PR to instantiate a parallel X! Tandem module is worth the cost for PR. In addition, from timings gathered during experiments, a general formula was derived for determining the value of performing PR upon a fast module

    AxleDB: A novel programmable query processing platform on FPGA

    Get PDF
    With the rise of Big Data, providing high-performance query processing capabilities through the acceleration of the database analytic has gained significant attention. Leveraging Field Programmable Gate Array (FPGA) technology, this approach can lead to clear benefits. In this work, we present the design and implementation of AxleDB: An FPGA-based platform that enables fast query processing for database systems by melding novel database-specific accelerators with commercial-off-the-shelf (COTS) storage using modern interfaces, in a novel, unified, and a programmable environment. AxleDB can perform a large subset of SQL queries through its set of instructions that can map compute-intensive database operations, such as filter, arithmetic, aggregate, group by, table join, or sort, on to the specialized high-throughput accelerators. To minimize the amount of SSD I/O operations required, AxleDB also supports hardware MinMax indexing for databases. We evaluated AxleDB with five decision support queries from the TPC-H benchmark suite and achieved a speedup from 1.8X to 34.2X and energy efficiency from 2.8X to 62.1X, in comparison to the state-of-the-art DBMS, i.e., PostgreSQL and MonetDB.The research leading to these results has received funding from the European Union Seventh Framework Program (FP7) (under the AXLE project GA number 318633), the Ministry of Economy and Competitiveness of Spain (under contract number TIN2015-65316-p), Turkish Ministry of Development TAM Project (number 2007K120610), and Bogazici University Scientific Projects (number 7060).Peer ReviewedPostprint (author's final draft

    It's all about data movement: Optimising FPGA data access to boost performance

    Get PDF
    The use of reconfigurable computing, and FPGAs in particular, to accelerate computational kernels has the potential to be of great benefit to scientific codes and the HPC community in general. However, whilst recent advanced in FPGA tooling have made the physical act of programming reconfigurable architectures much more accessible, in order to gain good performance the entire algorithm must be rethought and recast in a dataflow style. Reducing the cost of data movement for all computing devices is critically important, and in this paper we explore the most appropriate techniques for FPGAs. We do this by describing the optimisation of an existing FPGA implementation of an atmospheric model's advection scheme. By taking an FPGA code that was over four times slower than running on the CPU, mainly due to data movement overhead, we describe the profiling and optimisation strategies adopted to significantly reduce the runtime and bring the performance of our FPGA kernels to a much more practical level for real-world use. The result of this work is a set of techniques, steps, and lessons learnt that we have found significantly improves the performance of FPGA based HPC codes and that others can adopt in their own codes to achieve similar results.Comment: Preprint of article in 2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC

    Optimizing the FPGA memory design for a Sobel edge detector

    Get PDF
    This research explored different memory systems on FPGA chips in order to show the various trade-offs involved with choosing one memory system over another. We explored the different memory components that are found on FPGA chips using the example of a Sobel edge detector. We demonstrated how the different FPGA chip’s memories affected I/O performance and area. By exploiting the trade-offs between these a designer should be able to find an optimal on-chip memory system for a given application. Given further study, we believe we can develop application-specific memory templates that can be used with a hardware compiler to generate optimal on-chip memory system

    Advanced analytics through FPGA based query processing and deep reinforcement learning

    Get PDF
    Today, vast streams of structured and unstructured data have been incorporated in databases, and analytical processes are applied to discover patterns, correlations, trends and other useful relationships that help to take part in a broad range of decision-making processes. The amount of generated data has grown very large over the years, and conventional database processing methods from previous generations have not been sufficient to provide satisfactory results regarding analytics performance and prediction accuracy metrics. Thus, new methods are needed in a wide array of fields from computer architectures, storage systems, network design to statistics and physics. This thesis proposes two methods to address the current challenges and meet the future demands of advanced analytics. First, we present AxleDB, a Field Programmable Gate Array based query processing system which constitutes the frontend of an advanced analytics system. AxleDB melds highly-efficient accelerators with memory, storage and provides a unified programmable environment. AxleDB is capable of offloading complex Structured Query Language queries from host CPU. The experiments have shown that running a set of TPC-H queries, AxleDB can perform full queries between 1.8x and 34.2x faster and 2.8x to 62.1x more energy efficient compared to MonetDB, and PostgreSQL on a single workstation node. Second, we introduce TauRieL, a novel deep reinforcement learning (DRL) based method for combinatorial problems. The design idea behind combining DRL and combinatorial problems is to apply the prediction capabilities of deep reinforcement learning and to use the universality of combinatorial optimization problems to explore general purpose predictive methods. TauRieL utilizes an actor-critic inspired DRL architecture that adopts ordinary feedforward nets. Furthermore, TauRieL performs online training which unifies training and state space exploration. The experiments show that TauRieL can generate solutions two orders of magnitude faster and performs within 3% of accuracy compared to the state-of-the-art DRL on the Traveling Salesman Problem while searching for the shortest tour. Also, we present that TauRieL can be adapted to the Knapsack combinatorial problem. With a very minimal problem specific modification, TauRieL can outperform a Knapsack specific greedy heuristics.Hoy en día, se han incorporado grandes cantidades de datos estructurados y no estructurados en las bases de datos, y se les aplican procesos analíticos para descubrir patrones, correlaciones, tendencias y otras relaciones útiles que se utilizan mayormente para la toma de decisiones. La cantidad de datos generados ha crecido enormemente a lo largo de los años, y los métodos de procesamiento de bases de datos convencionales utilizados en las generaciones anteriores no son suficientes para proporcionar resultados satisfactorios respecto al rendimiento del análisis y respecto de la precisión de las predicciones. Por lo tanto, se necesitan nuevos métodos en una amplia gama de campos, desde arquitecturas de computadoras, sistemas de almacenamiento, diseño de redes hasta estadísticas y física. Esta tesis propone dos métodos para abordar los desafíos actuales y satisfacer las demandas futuras de análisis avanzado. Primero, presentamos AxleDB, un sistema de procesamiento de consultas basado en FPGAs (Field Programmable Gate Array) que constituye la interfaz de un sistema de análisis avanzado. AxleDB combina aceleradores altamente eficientes con memoria, almacenamiento y proporciona un entorno programable unificado. AxleDB es capaz de descargar consultas complejas de lenguaje de consulta estructurado desde la CPU del host. Los experimentos han demostrado que al ejecutar un conjunto de consultas TPC-H, AxleDB puede realizar consultas completas entre 1.8x y 34.2x más rápido y 2.8x a 62.1x más eficiente energéticamente que MonetDB, y PostgreSQL en un solo nodo de una estación de trabajo. En segundo lugar, presentamos TauRieL, un nuevo método basado en Deep Reinforcement Learning (DRL) para problemas combinatorios. La idea central que está detrás de la combinación de DRL y problemas combinatorios, es aplicar las capacidades de predicción del aprendizaje de refuerzo profundo y el uso de la universalidad de los problemas de optimización combinatoria para explorar métodos predictivos de propósito general. TauRieL utiliza una arquitectura DRL inspirada en el actor-crítico que se adapta a redes feedforward. Además, TauRieL realiza el entrenamieton en línea que unifica el entrenamiento y la exploración espacial de los estados. Los experimentos muestran que TauRieL puede generar soluciones dos órdenes de magnitud más rápido y funciona con un 3% de precisión en comparación con el estado del arte en DRL aplicado al problema del viajante mientras busca el recorrido más corto. Además, presentamos que TauRieL puede adaptarse al problema de la Mochila. Con una modificación específica muy mínima del problema, TauRieL puede superar a una heurística codiciosa de Knapsack Problem.Postprint (published version

    FPGA acceleration of sequence analysis tools in bioinformatics

    Full text link
    Thesis (Ph.D.)--Boston UniversityWith advances in biotechnology and computing power, biological data are being produced at an exceptional rate. The purpose of this study is to analyze the application of FPGAs to accelerate high impact production biosequence analysis tools. Compared with other alternatives, FPGAs offer huge compute power, lower power consumption, and reasonable flexibility. BLAST has become the de facto standard in bioinformatic approximate string matching and so its acceleration is of fundamental importance. It is a complex highly-optimized system, consisting of tens of thousands of lines of code and a large number of heuristics. Our idea is to emulate the main phases of its algorithm on FPGA. Utilizing our FPGA engine, we quickly reduce the size of the database to a small fraction, and then use the original code to process the query. Using a standard FPGA-based system, we achieved 12x speedup over a highly optimized multithread reference code. Multiple Sequence Alignment (MSA)--the extension of pairwise Sequence Alignment to multiple Sequences--is critical to solve many biological problems. Previous attempts to accelerate Clustal-W, the most commonly used MSA code, have directly mapped a portion of the code to the FPGA. We use a new approach: we apply prefiltering of the kind commonly used in BLAST to perform the initial all-pairs alignments. This results in a speedup of from 8Ox to 190x over the CPU code (8 cores). The quality is comparable to the original according to a commonly used benchmark suite evaluated with respect to multiple distance metrics. The challenge in FPGA-based acceleration is finding a suitable application mapping. Unfortunately many software heuristics do not fall into this category and so other methods must be applied. One is restructuring: an entirely new algorithm is applied. Another is to analyze application utilization and develop accuracy/performance tradeoffs. Using our prefiltering approach and novel FPGA programming models we have achieved significant speedup over reference programs. We have applied approximation, seeding, and filtering to this end. The bulk of this study is to introduce the pros and cons of these acceleration models for biosequence analysis tools

    A Garbled Circuit Accelerator for Arbitrary, Fast Privacy-Preserving Computation

    Full text link
    Privacy and security have rapidly emerged as priorities in system design. One powerful solution for providing both is privacy-preserving computation, where functions are computed directly on encrypted data and control can be provided over how data is used. Garbled circuits (GCs) are a PPC technology that provide both confidential computing and control over how data is used. The challenge is that they incur significant performance overheads compared to plaintext. This paper proposes a novel garbled circuit accelerator and compiler, named HAAC, to mitigate performance overheads and make privacy-preserving computation more practical. HAAC is a hardware-software co-design. GCs are exemplars of co-design as programs are completely known at compile time, i.e., all dependence, memory accesses, and control flow are fixed. The design philosophy of HAAC is to keep hardware simple and efficient, maximizing area devoted to our proposed custom execution units and other circuits essential for high performance (e.g., on-chip storage). The compiler can leverage its program understanding to realize hardware's performance potential by generating effective instruction schedules, data layouts, and orchestrating off-chip events. In taking this approach we can achieve ASIC performance/efficiency without sacrificing generality. Insights of our approach include how co-design enables expressing arbitrary GC programs as streams, which simplifies hardware and enables complete memory-compute decoupling, and the development of a scratchpad that captures data reuse by tracking program execution, eliminating the need for costly hardware managed caches and tagging logic. We evaluate HAAC with VIP-Bench and achieve a speedup of 608×\times in 4.3mm2^2 of area
    corecore