1,220 research outputs found

    High level synthesis of RDF queries for graph analytics

    Get PDF
    In this paper we present a set of techniques that enable the synthesis of efficient custom accelerators for memory intensive, irregular applications. To address the challenges of irregular applications (large memory footprint, unpredictable fine-grained data accesses, and high synchronization intensity), and exploit their opportunities (thread level parallelism, memory level parallelism), we propose a novel accelerator design that employs an adaptive and Distributed Controller (DC) architecture, and a Memory Interface Controller (MIC) that supports concurrent and atomic memory operations on a multi-ported/multi-banked shared memory. Among the multitude of algorithms that may benefit from our solution, we focus on the acceleration of graph analytics applications and, in particular, on the synthesis of SPARQL queries on Resource Description Framework (RDF) databases. We achieve this objective by incorporating the synthesis techniques into Bambu, an Open Source high-level synthesis tools, and interfacing it with GEMS, the Graph database Engine for Multithreaded Systems. The GEMS' front-end generates optimized C implementations of the input queries, modeled as graph pattern matching algorithms, which are then automatically synthesized by Bambu. We validate our approach by synthesizing several SPARQL queries from the Lehigh University Benchmark (LUBM)

    A performance, energy consumption and reliability evaluation of workload distribution on heterogeneous devices

    Get PDF
    The constant need of higher performances and reduced power consumption has lead vendors to design heterogeneous devices that embed traditional Central Process Unit (CPU) and an accelerator, like a Graphics Processing Unit (GPU) or Field-programmable Gate Array (FPGA). When the CPU and the accelerator are used collaboratively the device computational performances reach their peak. However, the higher amount of resources employed for computation has, potentially, the side effect of increasing soft error rate. This thesis evaluates the reliability behaviour of AMD Kaveri Accelerated Processing Units (APU) executing four heterogeneous applications, each one representing an algorithm class. The workload is gradually distributed from the CPU to the GPU and both the energy consumption and execution time are measured. Then, an accelerated neutron beam was used to measure the realistic error rates of the different workload distributions. Finally, we evaluate which configuration provides the lowest error rate or allows the computation of the highest amount of data before experiencing a failure. As is shown in this thesis, energy consumption and execution time are mold by the same trend while error rates highly depend on algorithm class and workload distribution. Additionally, we show that, in most cases, the most reliable workload distribution is the one that delivers the highest performances. As experimentally proven, by choosing the correct workload distribution the device reliability can increase of up to 90x.A constante necessidade de maior desempenho e menor consumo de energia levou aos fabricantes a projetar dispositivos heterogêneos que incorporam uma Unidade Central de Processameno (CPU) tradicional e um acelerador, como uma Unidade de Processamento Gráfico (GPU) ou um Arranjo de Portas Programáveis em Campo (FPGA). Quando a CPU e o acelerador são usados de forma colaborativa, o desempenho computacional do dispositivo atinge seu pico. No entanto, a maior quantidade de recursos empregados para o cálculo tem, potencialmente, o efeito colateral de aumentar a taxa de erros. Esta tese avalia a confiabilidade das AMD Kaveri "Accelerated Processing Units"(APUs) executando quatro aplicações heterogêneas, cada uma representando uma classe de algoritmos. A carga de trabalho é gradualmente distribuída da CPU para a GPU e o consumo de energia e o tempo de execução são medidos. Em seguida, um feixe de neutrões é utilizado para medir as taxas de erro reais das diferentes distribuições de carga de trabalho. Por fim, avalia-se qual configuração fornece a menor taxa de erro ou permite o cálculo da maior quantidade de dados antes de ocorrer uma falha. Como é mostrado nesta tese, o consumo de energia e o tempo de execução são moldados pela mesma tendência, enquanto as taxas de erro dependem da classe de algoritmos e da distribuição da carga de trabalho. Além disso, é mostrado que, na maioria dos casos, a distribuição de carga de trabalho mais confiável é a que fornece o maior desempenho. Como comprovado experimentalmente, ao escolher a distribuição de carga de trabalho correta, a confiabilidade do dispositivo pode aumentar até 9 vezes

    Object-oriented domain specific compilers for programming FPGAs

    No full text
    Published versio

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms
    corecore