150 research outputs found

    An efficient sparse conjugate gradient solver using a Beneš permutation network

    Get PDF
    © 2014 Technical University of Munich (TUM).The conjugate gradient (CG) is one of the most widely used iterative methods for solving systems of linear equations. However, parallelizing CG for large sparse systems is difficult due to the inherent irregularity in memory access pattern. We propose a novel processor architecture for the sparse conjugate gradient method. The architecture consists of multiple processing elements and memory banks, and is able to compute efficiently both sparse matrix-vector multiplication, and other dense vector operations. A Beneš permutation network with an optimised control scheme is introduced to reduce memory bank conflicts without expensive logic. We describe a heuristics for offline scheduling, the effect of which is captured in a parametric model for estimating the performance of designs generated from our approach

    Experimental survey of FPGA-based monolithic switches and a novel queue balancer

    Get PDF
    This paper studies small to medium-sized monolithic switches for FPGA implementation and presents a novel switch design that achieves high algorithmic performance and FPGA implementation efficiency. Crossbar switches based on virtual output queues (VOQs) and variations have been rather popular for implementing switches on FPGAs, with applications in network switches, memory interconnects, network-on-chip (NoC) routers etc. The implementation efficiency of crossbar-based switches is well-documented on ASICs, though we show that their disadvantages can outweigh their advantages on FPGAs. One of the most important challenges in such input-queued switches is the requirement for iterative scheduling algorithms. In contrast to ASICs, this is more harmful on FPGAs, as the reduced operating frequency and narrower packets cannot “hide” multiple iterations of scheduling that are required to achieve a modest scheduling performance.Our proposed design uses an output-queued switch internally for simplifying scheduling, and a queue balancing technique to avoid queue fragmentation and reduce the need for memory-sharing VOQs. Its implementation approaches the scheduling performance of a state-of-the-art FPGA-based switch, while requiring considerably fewer resources

    A unified approach for managing heterogeneous processing elements on FPGAs

    Get PDF
    FPGA designs do not typically include all available processing elements, e.g., LUTs, DSPs and embedded cores. Additional work is required to manage their different implementations and behaviour, which can unbalance parallel pipelines and complicate development. In this paper we introduce a novel management architecture to unify heterogeneous processing elements into compute pools. A pool formed of E processing elements, each implementing the same function, serves D parallel function calls. A call-and-response approach to computation allows for different processing element implementations, connections, latencies and non-deterministic behaviour. Our rotating scheduler automatically arbitrates access to processing elements, uses greatly simplified routing, and scales linearly with D parallel accesses to the compute pool. Processing elements can easily be added to improve performance, or removed to reduce resource use and routing, facilitating higher operating frequencies. Migrating to larger or smaller FPGAs thus comes at a known performance cost. We assess our framework with a range of neural network activation functions (ReLU, LReLU, ELU, GELU, sigmoid, swish, softplus and tanh) on the Xilinx Alveo U280

    Advanced analytics through FPGA based query processing and deep reinforcement learning

    Get PDF
    Today, vast streams of structured and unstructured data have been incorporated in databases, and analytical processes are applied to discover patterns, correlations, trends and other useful relationships that help to take part in a broad range of decision-making processes. The amount of generated data has grown very large over the years, and conventional database processing methods from previous generations have not been sufficient to provide satisfactory results regarding analytics performance and prediction accuracy metrics. Thus, new methods are needed in a wide array of fields from computer architectures, storage systems, network design to statistics and physics. This thesis proposes two methods to address the current challenges and meet the future demands of advanced analytics. First, we present AxleDB, a Field Programmable Gate Array based query processing system which constitutes the frontend of an advanced analytics system. AxleDB melds highly-efficient accelerators with memory, storage and provides a unified programmable environment. AxleDB is capable of offloading complex Structured Query Language queries from host CPU. The experiments have shown that running a set of TPC-H queries, AxleDB can perform full queries between 1.8x and 34.2x faster and 2.8x to 62.1x more energy efficient compared to MonetDB, and PostgreSQL on a single workstation node. Second, we introduce TauRieL, a novel deep reinforcement learning (DRL) based method for combinatorial problems. The design idea behind combining DRL and combinatorial problems is to apply the prediction capabilities of deep reinforcement learning and to use the universality of combinatorial optimization problems to explore general purpose predictive methods. TauRieL utilizes an actor-critic inspired DRL architecture that adopts ordinary feedforward nets. Furthermore, TauRieL performs online training which unifies training and state space exploration. The experiments show that TauRieL can generate solutions two orders of magnitude faster and performs within 3% of accuracy compared to the state-of-the-art DRL on the Traveling Salesman Problem while searching for the shortest tour. Also, we present that TauRieL can be adapted to the Knapsack combinatorial problem. With a very minimal problem specific modification, TauRieL can outperform a Knapsack specific greedy heuristics.Hoy en día, se han incorporado grandes cantidades de datos estructurados y no estructurados en las bases de datos, y se les aplican procesos analíticos para descubrir patrones, correlaciones, tendencias y otras relaciones útiles que se utilizan mayormente para la toma de decisiones. La cantidad de datos generados ha crecido enormemente a lo largo de los años, y los métodos de procesamiento de bases de datos convencionales utilizados en las generaciones anteriores no son suficientes para proporcionar resultados satisfactorios respecto al rendimiento del análisis y respecto de la precisión de las predicciones. Por lo tanto, se necesitan nuevos métodos en una amplia gama de campos, desde arquitecturas de computadoras, sistemas de almacenamiento, diseño de redes hasta estadísticas y física. Esta tesis propone dos métodos para abordar los desafíos actuales y satisfacer las demandas futuras de análisis avanzado. Primero, presentamos AxleDB, un sistema de procesamiento de consultas basado en FPGAs (Field Programmable Gate Array) que constituye la interfaz de un sistema de análisis avanzado. AxleDB combina aceleradores altamente eficientes con memoria, almacenamiento y proporciona un entorno programable unificado. AxleDB es capaz de descargar consultas complejas de lenguaje de consulta estructurado desde la CPU del host. Los experimentos han demostrado que al ejecutar un conjunto de consultas TPC-H, AxleDB puede realizar consultas completas entre 1.8x y 34.2x más rápido y 2.8x a 62.1x más eficiente energéticamente que MonetDB, y PostgreSQL en un solo nodo de una estación de trabajo. En segundo lugar, presentamos TauRieL, un nuevo método basado en Deep Reinforcement Learning (DRL) para problemas combinatorios. La idea central que está detrás de la combinación de DRL y problemas combinatorios, es aplicar las capacidades de predicción del aprendizaje de refuerzo profundo y el uso de la universalidad de los problemas de optimización combinatoria para explorar métodos predictivos de propósito general. TauRieL utiliza una arquitectura DRL inspirada en el actor-crítico que se adapta a redes feedforward. Además, TauRieL realiza el entrenamieton en línea que unifica el entrenamiento y la exploración espacial de los estados. Los experimentos muestran que TauRieL puede generar soluciones dos órdenes de magnitud más rápido y funciona con un 3% de precisión en comparación con el estado del arte en DRL aplicado al problema del viajante mientras busca el recorrido más corto. Además, presentamos que TauRieL puede adaptarse al problema de la Mochila. Con una modificación específica muy mínima del problema, TauRieL puede superar a una heurística codiciosa de Knapsack Problem.Postprint (published version

    Enhancing an embedded processor core for efficient and isolated execution of cryptographic algorithms

    Get PDF
    We propose enhancing a reconfigurable and extensible embedded RISC processor core with a protected zone for isolated execution of cryptographic algorithms. The protected zone is a collection of processor subsystems such as functional units optimized for high-speed execution of integer operations, a small amount of local memory for storing sensitive data during cryptographic computations, and special-purpose and cryptographic registers to execute instructions securely. We outline the principles for secure software implementations of cryptographic algorithms in a processor equipped with the proposed protected zone. We demonstrate the efficiency and effectiveness of our proposed zone by implementing the most-commonly used cryptographic algorithms in the protected zone; namely RSA, elliptic curve cryptography, pairing-based cryptography, AES block cipher, and SHA-1 and SHA-256 cryptographic hash functions. In terms of time efficiency, our software implementations of cryptographic algorithms running on the enhanced core compare favorably with equivalent software implementations on similar processors reported in the literature. The protected zone is designed in such a modular fashion that it can easily be integrated into any RISC processor. The proposed enhancements for the protected zone are realized on an FPGA device. The implementation results on the FPGA confirm that its area overhead is relatively moderate in the sense that it can be used in many embedded processors. Finally, the protected zone is useful against cold-boot and micro-architectural side-channel attacks such as cache-based and branch prediction attacks

    Convey HC-1 Hybrid Core Computer - The Potential of FPGAs in Numerical Simulation

    Get PDF

    Cache side-channel attacks and time-predictability in high-performance critical real-time systems

    Get PDF
    Embedded computers control an increasing number of systems directly interacting with humans, while also manage more and more personal or sensitive information. As a result, both safety and security are becoming ubiquitous requirements in embedded computers, and automotive is not an exception to that. In this paper we analyze time-predictability (as an example of safety concern) and side-channel attacks (as an example of security issue) in cache memories. While injecting randomization in cache timing-behavior addresses each of those concerns separately, we show that randomization solutions for time-predictability do not protect against side-channel attacks and vice-versa. We then propose a randomization solution to achieve both safety and security goals.This work has been partially funded by the Spanish Ministry of Science and Innovation under grant TIN2015-65316-P. Jaume Abella has been partially supported by the Ministry of Economy and Competitiveness under Ramon y Cajal fellowship number RYC-2013-14717. Authors want to thank Boris Kpf for his technical comments in early versions of this work.Peer ReviewedPostprint (published version

    Universal FFT core generator

    Get PDF
    This thesis presents a special purpose processor for multi-dimensional discrete Fourier transform (DFT) computation. The processor, called a Universal fast Fourier transform (FFT) processor, is capable of computing an arbitrary multi-dimensional DFT with a fixed number of input points. The processor can be configured to compute different dimensional DFTs simply by rearranging the input data and using a different set of constants, called generalized twiddle factors, throughout the computation. The basic computational process remains the same independent of dimension. The processor does not rely on one-dimensional FFT computation as does the standard approach using the row-column algorithm, but instead utilizes an algorithm called the Dimensionless FFT, whose computation data flow is the same as the FFT algorithm due to Pease.Since the computation performed by the Universal FFT processor is the same as the Pease FFT, an implementation can be obtained by modifying a one-dimensional FFT processor based on the Pease algorithm. The implementation presented in this thesis is obtained in this way from the core generator by Nordin, Milder, Hoe, and P¨uschel. The core generator from Nordin, Milder, Hoe, and P¨uschel generates a variety of cores, with various space and time tradeoffs for implementation on field programmable gate arrays (FPGA). The performance obtained is comparable to vendor supplied cores but offers much greater flexibility of design choices with different area and throughput tradeoffs. The resulting Universal FFT core generator can be used to instantiate multi-dimensional FPGA cores with approximately the same performance and area obtained in the one-dimensional case by Nordin, Milder, Hoe, and P¨uschel.M.S., Computer Science -- Drexel University, 200

    Nouveaux transmetteurs/récepteurs pour les systèmes sans fil MIMO-OFDM : de l'idée à la mise en oeuvre

    Get PDF
    corecore