17 research outputs found

    Localidad estructural, criterio de división para la ejecución de redes de Petri no autónomas en IP-Core

    Get PDF
    Este trabajo propone un nuevo concepto, localidad estructural, para dividir y representar redes de Petri no autónomas con el fin de reducir de forma significativa los recursos de hardware de los IP-Core que las ejecutan. Con la consecuente ventaja de lograr asi abordar problemas de mayor tamaño. Las redes así expresadas dan origen a un algoritmo de ejecución que preserva el modelo original y facilita el paralelismo. En este trabajo se expone un caso de aplicación donde se manifiestan las ventajas de la localidad estructural aplicada a una red de Petri con diferentes semánticas temporales y tipos de brazos, obteniendo una disminución de recursos en la FPGA que implementa el IP-Core.This paper proposes a new concept, structural locality, to divide and represent no autonomous Petri nets, with the aim of significantly reduce the hardware resources of the IP-cores that run them. With the consequent advantage of addressing in this way larger problems. The Petri nets represented by this way raise an algorithm of execution which preserves the original model and facilitates the parallelism. Finally in this paper we expose a real case of application where shows the advantages of the structural locality applied to a Petri net with different temporal semantics and types of arms, achieving an important decrease in resources of the FPGA that implements the IP-Core.X Workshop Arquitectura, Redes y Sistemas Operativos (WARSO)Red de Universidades con Carreras en Informática (RedUNCI

    Localidad estructural, criterio de división para la ejecución de redes de Petri no autónomas en IP-Core

    Get PDF
    Este trabajo propone un nuevo concepto, localidad estructural, para dividir y representar redes de Petri no autónomas con el fin de reducir de forma significativa los recursos de hardware de los IP-Core que las ejecutan. Con la consecuente ventaja de lograr asi abordar problemas de mayor tamaño. Las redes así expresadas dan origen a un algoritmo de ejecución que preserva el modelo original y facilita el paralelismo. En este trabajo se expone un caso de aplicación donde se manifiestan las ventajas de la localidad estructural aplicada a una red de Petri con diferentes semánticas temporales y tipos de brazos, obteniendo una disminución de recursos en la FPGA que implementa el IP-Core.This paper proposes a new concept, structural locality, to divide and represent no autonomous Petri nets, with the aim of significantly reduce the hardware resources of the IP-cores that run them. With the consequent advantage of addressing in this way larger problems. The Petri nets represented by this way raise an algorithm of execution which preserves the original model and facilitates the parallelism. Finally in this paper we expose a real case of application where shows the advantages of the structural locality applied to a Petri net with different temporal semantics and types of arms, achieving an important decrease in resources of the FPGA that implements the IP-Core.X Workshop Arquitectura, Redes y Sistemas Operativos (WARSO)Red de Universidades con Carreras en Informática (RedUNCI

    Localidad estructural, criterio de división para la ejecución de redes de Petri no autónomas en IP-Core

    Get PDF
    Este trabajo propone un nuevo concepto, localidad estructural, para dividir y representar redes de Petri no autónomas con el fin de reducir de forma significativa los recursos de hardware de los IP-Core que las ejecutan. Con la consecuente ventaja de lograr asi abordar problemas de mayor tamaño. Las redes así expresadas dan origen a un algoritmo de ejecución que preserva el modelo original y facilita el paralelismo. En este trabajo se expone un caso de aplicación donde se manifiestan las ventajas de la localidad estructural aplicada a una red de Petri con diferentes semánticas temporales y tipos de brazos, obteniendo una disminución de recursos en la FPGA que implementa el IP-Core.This paper proposes a new concept, structural locality, to divide and represent no autonomous Petri nets, with the aim of significantly reduce the hardware resources of the IP-cores that run them. With the consequent advantage of addressing in this way larger problems. The Petri nets represented by this way raise an algorithm of execution which preserves the original model and facilitates the parallelism. Finally in this paper we expose a real case of application where shows the advantages of the structural locality applied to a Petri net with different temporal semantics and types of arms, achieving an important decrease in resources of the FPGA that implements the IP-Core.X Workshop Arquitectura, Redes y Sistemas Operativos (WARSO)Red de Universidades con Carreras en Informática (RedUNCI

    Cooperative high-performance computing with FPGAs - matrix multiply case-study

    Get PDF
    In high-performance computing, there is great opportunity for systems that use FPGAs to handle communication while also performing computation on data in transit in an ``altruistic'' manner--that is, using resources for computation that might otherwise be used for communication, and in a way that improves overall system performance and efficiency. We provide a specific definition of \textbf{Computing in the Network} that captures this opportunity. We then outline some overall requirements and guidelines for cooperative computing that include this ability, and make suggestions for specific computing capabilities to be added to the networking hardware in a system. We then explore some algorithms running on a network so equipped for a few specific computing tasks: dense matrix multiplication, sparse matrix transposition and sparse matrix multiplication. In the first instance we give limits of problem size and estimates of performance that should be attainable with present-day FPGA hardware

    Computing SpMV on FPGAs

    Get PDF
    There are hundreds of papers on accelerating sparse matrix vector multiplication (SpMV), however, only a handful target FPGAs. Some claim that FPGAs inherently perform inferiorly to CPUs and GPUs. FPGAs do perform inferiorly for some applications like matrix-matrix multiplication and matrix-vector multiplication. CPUs and GPUs have too much memory bandwidth and too much floating point computation power for FPGAs to compete. However, the low computations to memory operations ratio and irregular memory access of SpMV trips up both CPUs and GPUs. We see this as a leveling of the playing field for FPGAs. Our implementation focuses on three pillars: matrix traversal, multiply-accumulator design, and matrix compression. First, most SpMV implementations traverse the matrix in row-major order, but we mix column and row traversal. Second, To accommodate the new traversal the multiply accumulator stores many intermediate y values. Third, we compress the matrix to increase the transfer rate of the matrix from RAM to the FPGA. Together these pillars enable our SpMV implementation to perform competitively with CPUs and GPUs

    Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA

    Get PDF
    Deep neural network (DNN) has achieved remarkable success in many applications because of its powerful capability for data processing. Their performance in computer vision have matched and in some areas even surpassed human capabilities. Deep neural networks can capture complex nonlinear features; however this ability comes at the cost of high computational and memory requirements. State-of-art networks require billions of arithmetic operations and millions of parameters. The brute-force computing model of DNN often requires extremely large hardware resources, introducing severe concerns on its scalability running on traditional von Neumann architecture. The well-known memory wall, and latency brought by the long-range connectivity and communication of DNN severely constrain the computation efficiency of DNN. The acceleration techniques of DNN, either software or hardware, often suffer from poor hardware execution efficiency of the simplified model (software), or inevitable accuracy degradation and limited supportable algorithms (hardware), respectively. In order to preserve the inference accuracy and make the hardware implementation in a more efficient form, a close investigation to the hardware/software co-design methodologies for DNNs is needed. The proposed work first presents an FPGA-based implementation framework for Recurrent Neural Network (RNN) acceleration. At architectural level, we improve the parallelism of RNN training scheme and reduce the computing resource requirement for computation efficiency enhancement. The hardware implementation primarily targets at reducing data communication load. Secondly, we propose a data locality-aware sparse matrix and vector multiplication (SpMV) kernel. At software level, we reorganize a large sparse matrix into many modest-sized blocks by adopting hypergraph-based partitioning and clustering. Available hardware constraints have been taken into consideration for the memory allocation and data access regularization. Thirdly, we present a holistic acceleration to sparse convolutional neural network (CNN). During network training, the data locality is regularized to ease the hardware mapping. The distributed architecture enables high computation parallelism and data reuse. The proposed research results in an hardware/software co-design methodology for fast and accurate DNN acceleration, through the innovations in algorithm optimization, hardware implementation, and the interactive design process across these two domains

    Implementation of Ultra-Low Latency and High-Speed Communication Channels for an FPGA-Based HPC Cluster

    Get PDF
    RÉSUMÉ Les clusters basés sur les FPGA bénéficient de leur flexibilité et de leurs performances en termes de puissance de calcul et de faible consommation. Et puisque la consommation de puissance devient un élément de plus en plus importants sur le marché des superordinateurs, le domaine d’exploration multi-FPGA devient chaque année plus populaire. Les performances des ordinateurs n’ont jamais cessé d’augmenter mais la latence des réseaux d’interconnexion n’a pas suivi leur taux d’amélioration. Dans le but d’augmenter le niveau d’abstraction et les fonctionnalités des interconnexions, la complexité des piles de communication atteinte à nos jours engendre des coûts et affecte la latence des communications, ce qui rend ces piles de communication très souvent inefficaces, voire inutiles. Les protocoles de communication commerciaux existants et les contrôleurs d’interfaces réseau FPGA-FPGA n’ont la performance pour supporter ni les applications à temps critique ni un partitionnement étroitement couplé des systèmes sur puce. Au lieu de cela, les approches de communication personnalisées sont souvent préférées. Dans ce travail, nous proposons une implémentation de canaux de communication à haut débit et à faible latence pour une grappe de FPGA. Le système est constitué de deux BEE3, chacun contenant 4 FPGA de la famille Virtex-5 interconnectés par une topologie en anneau. Notre approche exploite la technologie à transducteur à plusieurs gigabits par seconde pour l’obtention d’une bande passante fiable de 8Gbps. Le module de propriété intellectuelle (IP) de communication proposé permet le transfert de données entre des milliers de coprocesseurs sur le réseau, grâce à l’implémentation d’un réseau direct avec capacité de routage de paquets. Les résultats expérimentaux ont montré une latence de seulement 34 cycles d’horloge entre deux noeuds voisins, ce qui est un des plus bas parmi ceux rapportés dans la littérature. En outre, nous proposons une architecture adaptée au calcul à haute performance qui comporte un traitement extensible, parallèle et distribué. Pour une plateforme à 8 FPGA, l’architecture fournit 35.6Go/s de bande passante effective pour la mémoire externe, une bande passante globale de réseau de 128Gbps et une puissance de calcul de 8.9GFLOPS. Un solveur matrice-vecteur de grande taille est partitionné et mis en oeuvre à travers le cluster. Nous avons obtenu une performance et une efficacité de calcul concurrentielles grâce à la faible empreinte du protocole de communication entre les éléments de traitement distribués. Ce travail contribue à soutenir de nouvelles recherches dans le domaine du calcul parallèle intensif et permet le partitionnement de système sur puce à grande taille sur des clusters à base de FPGA.----------ABSTRACT An FPGA-based cluster profits from the flexibility and the performance potential FPGA technology provides. Since price and power consumption are becoming increasingly important elements in the High-Performance Computing market, the multi-FPGA exploration field is getting more popular each year. Network latency has failed to keep up with other improvements in computer performance. Complex communication stacks have sacrificed latency and increased overhead to achieve other goals, being in most of the time inefficient and unnecessary. The existing commercial offthe- shelf communication protocols and Network Interfaces Controllers for FPGA-to-FPGA interconnection lack of performance to support time-critical applications and tightly coupled System-on-Chip partitioning. Instead, custom communication approaches are preferred. In this work, ultra-low latency and high-speed communication channels for an FPGA-based cluster are presented. Two BEE3s grouping 8 FPGAs Virtex-5 interconnected in a ring topology, compose the targeting platform. Our approach exploits Multi-Gigabit Transceiver technology to achieve reliable 8Gbps channel bandwidth. The proposed communication IP supports data transfer from coprocessors over the network, by means of a direct network implementation with hop-by-hop packet routing capability. Experimental results showed a latency of only 34 clock cycles between two neighboring nodes, being one of the lowest in the literature. In addition, it is proposed an architecture suitable for High-Performance Computing which includes performing scalable, parallel, and distributed processing. For an 8 FPGAs platform, the architecture provides 35.6GB/s off-chip memory throughput, 128Gbps network aggregate bandwidth, and 8.9GFLOPS computing power. A large and dense matrix-vector solver is partitioned and implemented across the cluster. We achieved competitive performance and computational efficiency as a result of the low communication overhead among the distributed processing elements. This work contributes to support new researches on the intense parallel computing fields, and enables large System-on-Chip partitioning and scaling on FPGA-based clusters
    corecore