8 research outputs found

    A unified approach for managing heterogeneous processing elements on FPGAs

    Get PDF
    FPGA designs do not typically include all available processing elements, e.g., LUTs, DSPs and embedded cores. Additional work is required to manage their different implementations and behaviour, which can unbalance parallel pipelines and complicate development. In this paper we introduce a novel management architecture to unify heterogeneous processing elements into compute pools. A pool formed of E processing elements, each implementing the same function, serves D parallel function calls. A call-and-response approach to computation allows for different processing element implementations, connections, latencies and non-deterministic behaviour. Our rotating scheduler automatically arbitrates access to processing elements, uses greatly simplified routing, and scales linearly with D parallel accesses to the compute pool. Processing elements can easily be added to improve performance, or removed to reduce resource use and routing, facilitating higher operating frequencies. Migrating to larger or smaller FPGAs thus comes at a known performance cost. We assess our framework with a range of neural network activation functions (ReLU, LReLU, ELU, GELU, sigmoid, swish, softplus and tanh) on the Xilinx Alveo U280

    Aceleração de hardware em sistemas embarcados para aprendizado de máquina utilizando KNN em FPGA

    Get PDF
    Aprendizado de máquina tem se tornado uma ferramenta essencialpara qualquer sistema de tomada de decisão. Devido a limitaçõesde performance impostas por arquiteturas tradicionais que utilizamCentral Processing Units (CPUs), para aplicações mais críticas, métodosde aceleração com Graphical Processing Unit (GPU) e ApplicationSpecific Integrated Circuit (ASIC) têm sido empregados. No entanto,quando aplicadas a sistemas embarcados, estas apresentam limitaçõesrelacionadas a tamanho físico e complexidade. Para resolverestes problemas, a utilização da tecnologia Field Programmable GateArray (FPGA) tem se mostrado promissora devido a sua grande eficiência,paralelismo real, reconfigurabilidade e flexibilidade. Diantedisso, este estudo tem como objetivo, além de fazer uma revisãoaprofundada da bibliografia, apresentar arquiteturas projetadasem FPGA que buscam minimizar tais limitações, maximizando aeficiência, sem perda de performance significativa e de modo a viabilizarsua utilização em sistemas embarcados. Resultados mostramganhos em performance acima de 95% quando utilizando um hardwareespecialista desenvolvido em FPGA utilizando o algoritmo deaprendizado de máquina K-Nearest Neighbor (KNN)

    GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption

    Get PDF
    Fully Homomorphic Encryption (FHE) enables the processing of encrypted data without decrypting it. FHE has garnered significant attention over the past decade as it supports secure outsourcing of data processing to remote cloud services. Despite its promise of strong data privacy and security guarantees, FHE introduces a slowdown of up to five orders of magnitude as compared to the same computation using plaintext data. This overhead is presently a major barrier to the commercial adoption of FHE. While prior efforts recommend moving to custom accelerators to accelerate FHE computing, these solutions lack cost-effectiveness and scalability. In this work, we leverage GPUs to accelerate FHE, capitalizing on a well-established GPU ecosystem that is available in the cloud. We propose GME, which combines three key microarchitectural extensions along with a compile-time optimization to the current AMD CDNA GPU architecture. First, GME integrates a lightweight on-chip compute unit (CU)-side hierarchical interconnect to retain ciphertext in cache across FHE kernels, thus eliminating redundant memory transactions and improving performance. Second, to tackle compute bottlenecks, GME introduces special MOD-units that provide native custom hardware support for modular reduction operations, one of the most commonly executed sets of operations in FHE. Third, by integrating the MOD-unit with our novel pipelined 64-bit integer arithmetic cores (WMAC-units), GME further accelerates FHE workloads by 19%. Finally, we propose a Locality-Aware Block Scheduler (LABS) that improves FHE workload performance, exploiting the temporal locality available in FHE primitive blocks. Incorporating these microarchitectural features and compiler optimizations, we create a synergistic approach achieving average speedups of 796×, 14.2×, and 2.3× over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA implementations, respectively

    A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective

    No full text
    corecore