73 research outputs found

    Investigating data throughput and partial dynamic reconfiguration in a commodity FPGA cluster framework

    Get PDF
    There are many computational kernels where parallelism can be exploited in applica- tion specific hardware, yielding significant speedup over a general purpose processor based solution. Commodity cluster computing technologies have been combined with FPGA co- processors, resulting in even greater performance capability through the exploitation of multiple levels of parallelism. One particularly economic solution both in terms of cost and power consumption is to cluster hybrid FPGAs with commodity network intercon- nects. Hybrid FPGAs combine embedded microprocessors with reconfigurable hardware resources on a single chip offering lower power consumption and cost compared to a tra- ditional I/O bus FPGA coprocessor solution. While there is a lot of promise in using com- modity hybrid FPGAs in a cluster configuration, the design flow and performance char- acteristics of such systems are currently a limiting factor to the range of applications that could benefit from such a system. The contribution of this thesis is a framework for clustering commodity FPGAs which integrates high speed DMA data transfers with a flexible FPGA resource sharing scheme enabled through partial reconfiguration. The framework includes an embedded Linux op- erating system, with a custom device driver to manage data transfers and hardware recon- figuration. User space tools for cluster computing including ssh and MPI are deployed allowing tasks to be split among nodes in the cluster. Performance analysis is performed with a homogeneous cluster composed of four Virtex-5 FXT based FPGA boards. The results demonstrate the advantages over previous work in terms of data throughput and reconfiguration, as well as promote future research efforts

    A Dynamically Reconfigurable Parallel Processing Framework with Application to High-Performance Video Processing

    Get PDF
    Digital video processing demands have and will continue to grow at unprecedented rates. Growth comes from ever increasing volume of data, demand for higher resolution, higher frame rates, and the need for high capacity communications. Moreover, economic realities force continued reductions in size, weight and power requirements. The ever-changing needs and complexities associated with effective video processing systems leads to the consideration of dynamically reconfigurable systems. The goal of this dissertation research was to develop and demonstrate the viability of integrated parallel processing system that effectively and efficiently apply pre-optimized hardware cores for processing video streamed data. Digital video is decomposed into packets which are then distributed over a group of parallel video processing cores. Real time processing requires an effective task scheduler that distributes video packets efficiently to any of the reconfigurable distributed processing nodes across the framework, with the nodes running on FPGA reconfigurable logic in an inherently Virtual\u27 mode. The developed framework, coupled with the use of hardware techniques for dynamic processing optimization achieves an optimal cost/power/performance realization for video processing applications. The system is evaluated by testing processor utilization relative to I/O bandwidth and algorithm latency using a separable 2-D FIR filtering system, and a dynamic pixel processor. For these applications, the system can achieve performance of hundreds of 640x480 video frames per second across an eight lane Gen I PCIe bus. Overall, optimal performance is achieved in the sense that video data is processed at the maximum possible rate that can be streamed through the processing cores. This performance, coupled with inherent ability to dynamically add new algorithms to the described dynamically reconfigurable distributed processing framework, creates new opportunities for realizable and economic hardware virtualization.\u2

    The AXIOM software layers

    Get PDF
    AXIOM project aims at developing a heterogeneous computing board (SMP-FPGA).The Software Layers developed at the AXIOM project are explained.OmpSs provides an easy way to execute heterogeneous codes in multiple cores. People and objects will soon share the same digital network for information exchange in a world named as the age of the cyber-physical systems. The general expectation is that people and systems will interact in real-time. This poses pressure onto systems design to support increasing demands on computational power, while keeping a low power envelop. Additionally, modular scaling and easy programmability are also important to ensure these systems to become widespread. The whole set of expectations impose scientific and technological challenges that need to be properly addressed.The AXIOM project (Agile, eXtensible, fast I/O Module) will research new hardware/software architectures for cyber-physical systems to meet such expectations. The technical approach aims at solving fundamental problems to enable easy programmability of heterogeneous multi-core multi-board systems. AXIOM proposes the use of the task-based OmpSs programming model, leveraging low-level communication interfaces provided by the hardware. Modular scalability will be possible thanks to a fast interconnect embedded into each module. To this aim, an innovative ARM and FPGA-based board will be designed, with enhanced capabilities for interfacing with the physical world. Its effectiveness will be demonstrated with key scenarios such as Smart Video-Surveillance and Smart Living/Home (domotics).Peer ReviewedPostprint (author's final draft

    Low-cost, high-speed parallel FIR filters for RFSoC front-ends enabled by CλaSH

    Get PDF
    We present a new low-cost, high-speed parallel FIR filter generator targeting the Xilinx Radio Frequency System on Chip (RFSoC) and direct RF sampling applications. We compose two existing approaches in a novel hierarchy: efficient parallelism with Fast FIR Algorithm (FFA) structures, and efficient multiplierless FIR implementations with Hcub. The resource usage advantages (in both area and type) are compared with similar output from the traditional architecture, exemplified by vendor tools, as well as the Hcub-based filters without the FFA optimisation. Although these techniques are well studied individually in the literature, they have not enjoyed mainstream use as their structural complexity proves awkward to capture with traditional Hardware Description Languages (HDLs). This work continues a discussion of the use of functional programming techniques in hardware description, highlighting the benefits of having easily composable circuit generators

    Data reuse design exploration in OmpSs@FPGA

    Get PDF
    In this thesis, the OmpSs@FPGA tool chain has been extended to try to reduce the overall communication time due to copies of data when it is possible to reuse data already in the BRAM of the accelerators

    DIstributed VIRtual System (DIVIRS) Project

    Get PDF
    The development of Prospero moved from the University of Washington to ISI and several new versions of the software were released from ISI during the contract period. Changes in the first release from ISI included bug fixes and extensions to support the needs of specific users. Among these changes was a new option to directory queries that allows attributes to be returned for all files in a directory together with the directory listing. This change greatly improves the performance of their server and reduces the number of packets sent across their trans-pacific connection to the rest of the internet. Several new access method were added to the Prospero file method. The Prospero Data Access Protocol was designed, to support secure retrieval of data from systems running Prospero

    The Customizable Virtual FPGA: Generation, System Integration and Configuration of Application-Specific Heterogeneous FPGA Architectures

    Get PDF
    In den vergangenen drei Jahrzehnten wurde die Entwicklung von Field Programmable Gate Arrays (FPGAs) stark von Moore’s Gesetz, Prozesstechnologie (Skalierung) und kommerziellen MĂ€rkten beeinflusst. State-of-the-Art FPGAs bewegen sich einerseits dem Allzweck nĂ€her, aber andererseits, da FPGAs immer mehr traditionelle DomĂ€nen der Anwendungsspezifischen integrierten Schaltungen (ASICs) ersetzt haben, steigen die Effizienzerwartungen. Mit dem Ende der Dennard-Skalierung können Effizienzsteigerungen nicht mehr auf Technologie-Skalierung allein zurĂŒckgreifen. Diese Facetten und Trends in Richtung rekonfigurierbarer System-on-Chips (SoCs) und neuen Low-Power-Anwendungen wie Cyber Physical Systems und Internet of Things erfordern eine bessere Anpassung der Ziel-FPGAs. Neben den Trends fĂŒr den Mainstream-Einsatz von FPGAs in Produkten des tĂ€glichen Bedarfs und Services wird es vor allem bei den jĂŒngsten Entwicklungen, FPGAs in Rechenzentren und Cloud-Services einzusetzen, notwendig sein, eine sofortige PortabilitĂ€t von Applikationen ĂŒber aktuelle und zukĂŒnftige FPGA-GerĂ€te hinweg zu gewĂ€hrleisten. In diesem Zusammenhang kann die Hardware-Virtualisierung ein nahtloses Mittel fĂŒr PlattformunabhĂ€ngigkeit und PortabilitĂ€t sein. Ehrlich gesagt stehen die Zwecke der Anpassung und der Virtualisierung eigentlich in einem Konfliktfeld, da die Anpassung fĂŒr die Effizienzsteigerung vorgesehen ist, wĂ€hrend jedoch die Virtualisierung zusĂ€tzlichen FlĂ€chenaufwand hinzufĂŒgt. Die Virtualisierung profitiert aber nicht nur von der Anpassung, sondern fĂŒgt auch mehr FlexibilitĂ€t hinzu, da die Architektur jederzeit verĂ€ndert werden kann. Diese Besonderheit kann fĂŒr adaptive Systeme ausgenutzt werden. Sowohl die Anpassung als auch die Virtualisierung von FPGA-Architekturen wurden in der Industrie bisher kaum adressiert. Trotz einiger existierenden akademischen Werke können diese Techniken noch als unerforscht betrachtet werden und sind aufstrebende Forschungsgebiete. Das Hauptziel dieser Arbeit ist die Generierung von FPGA-Architekturen, die auf eine effiziente Anpassung an die Applikation zugeschnitten sind. Im Gegensatz zum ĂŒblichen Ansatz mit kommerziellen FPGAs, bei denen die FPGA-Architektur als gegeben betrachtet wird und die Applikation auf die vorhandenen Ressourcen abgebildet wird, folgt diese Arbeit einem neuen Paradigma, in dem die Applikation oder Applikationsklasse fest steht und die Zielarchitektur auf die effiziente Anpassung an die Applikation zugeschnitten ist. Dies resultiert in angepassten anwendungsspezifischen FPGAs. Die drei SĂ€ulen dieser Arbeit sind die Aspekte der Virtualisierung, der Anpassung und des Frameworks. Das zentrale Element ist eine weitgehend parametrierbare virtuelle FPGA-Architektur, die V-FPGA genannt wird, wobei sie als primĂ€res Ziel auf jeden kommerziellen FPGA abgebildet werden kann, wĂ€hrend Anwendungen auf der virtuellen Schicht ausgefĂŒhrt werden. Dies sorgt fĂŒr PortabilitĂ€t und Migration auch auf Bitstream-Ebene, da die Spezifikation der virtuellen Schicht bestehen bleibt, wĂ€hrend die physische Plattform ausgetauscht werden kann. DarĂŒber hinaus wird diese Technik genutzt, um eine dynamische und partielle Rekonfiguration auf Plattformen zu ermöglichen, die sie nicht nativ unterstĂŒtzen. Neben der Virtualisierung soll die V-FPGA-Architektur auch als eingebettetes FPGA in ein ASIC integriert werden, das effiziente und dennoch flexible System-on-Chip-Lösungen bietet. Daher werden Zieltechnologie-Abbildungs-Methoden sowohl fĂŒr Virtualisierung als auch fĂŒr die physikalische Umsetzung adressiert und ein Beispiel fĂŒr die physikalische Umsetzung in einem 45 nm Standardzellen Ansatz aufgezeigt. Die hochflexible V-FPGA-Architektur kann mit mehr als 20 Parametern angepasst werden, darunter LUT-Grösse, Clustering, 3D-Stacking, Routing-Struktur und vieles mehr. Die Auswirkungen der Parameter auf FlĂ€che und Leistung der Architektur werden untersucht und eine umfangreiche Analyse von ĂŒber 1400 BenchmarklĂ€ufen zeigt eine hohe Parameterempfindlichkeit bei Abweichungen bis zu ±95, 9% in der FlĂ€che und ±78, 1% in der Leistung, was die hohe Bedeutung von Anpassung fĂŒr Effizienz aufzeigt. Um die Parameter systematisch an die BedĂŒrfnisse der Applikation anzupassen, wird eine parametrische Entwurfsraum-Explorationsmethode auf der Basis geeigneter FlĂ€chen- und Zeitmodellen vorgeschlagen. Eine Herausforderung von angepassten Architekturen ist der Entwurfsaufwand und die Notwendigkeit fĂŒr angepasste Werkzeuge. Daher umfasst diese Arbeit ein Framework fĂŒr die Architekturgenerierung, die Entwurfsraumexploration, die Anwendungsabbildung und die Evaluation. Vor allem ist der V-FPGA in einem vollstĂ€ndig synthetisierbaren generischen Very High Speed Integrated Circuit Hardware Description Language (VHDL) Code konzipiert, der sehr flexibel ist und die Notwendigkeit fĂŒr externe Codegeneratoren eliminiert. Systementwickler können von verschiedenen Arten von generischen SoC-Architekturvorlagen profitieren, um die Entwicklungszeit zu reduzieren. Alle notwendigen Konstruktionsschritte fĂŒr die Applikationsentwicklung und -abbildung auf den V-FPGA werden durch einen Tool-Flow fĂŒr Entwurfsautomatisierung unterstĂŒtzt, der eine Sammlung von vorhandenen kommerziellen und akademischen Werkzeugen ausnutzt, die durch geeignete Modelle angepasst und durch ein neues Werkzeug namens V-FPGA-Explorer ergĂ€nzt werden. Dieses neue Tool fungiert nicht nur als Back-End-Tool fĂŒr die Anwendungsabbildung auf dem V-FPGA sondern ist auch ein grafischer Konfigurations- und Layout-Editor, ein Bitstream-Generator, ein Architekturdatei-Generator fĂŒr die Place & Route Tools, ein Script-Generator und ein Testbenchgenerator. Eine Besonderheit ist die UnterstĂŒtzung der Just-in-Time-Kompilierung mit schnellen Algorithmen fĂŒr die In-System Anwendungsabbildung. Die Arbeit schliesst mit einigen AnwendungsfĂ€llen aus den Bereichen industrielle Prozessautomatisierung, medizinische Bildgebung, adaptive Systeme und Lehre ab, in denen der V-FPGA eingesetzt wird

    Implementation of Ultra-Low Latency and High-Speed Communication Channels for an FPGA-Based HPC Cluster

    Get PDF
    RÉSUMÉ Les clusters basĂ©s sur les FPGA bĂ©nĂ©ficient de leur flexibilitĂ© et de leurs performances en termes de puissance de calcul et de faible consommation. Et puisque la consommation de puissance devient un Ă©lĂ©ment de plus en plus importants sur le marchĂ© des superordinateurs, le domaine d’exploration multi-FPGA devient chaque annĂ©e plus populaire. Les performances des ordinateurs n’ont jamais cessĂ© d’augmenter mais la latence des rĂ©seaux d’interconnexion n’a pas suivi leur taux d’amĂ©lioration. Dans le but d’augmenter le niveau d’abstraction et les fonctionnalitĂ©s des interconnexions, la complexitĂ© des piles de communication atteinte Ă  nos jours engendre des coĂ»ts et affecte la latence des communications, ce qui rend ces piles de communication trĂšs souvent inefficaces, voire inutiles. Les protocoles de communication commerciaux existants et les contrĂŽleurs d’interfaces rĂ©seau FPGA-FPGA n’ont la performance pour supporter ni les applications Ă  temps critique ni un partitionnement Ă©troitement couplĂ© des systĂšmes sur puce. Au lieu de cela, les approches de communication personnalisĂ©es sont souvent prĂ©fĂ©rĂ©es. Dans ce travail, nous proposons une implĂ©mentation de canaux de communication Ă  haut dĂ©bit et Ă  faible latence pour une grappe de FPGA. Le systĂšme est constituĂ© de deux BEE3, chacun contenant 4 FPGA de la famille Virtex-5 interconnectĂ©s par une topologie en anneau. Notre approche exploite la technologie Ă  transducteur Ă  plusieurs gigabits par seconde pour l’obtention d’une bande passante fiable de 8Gbps. Le module de propriĂ©tĂ© intellectuelle (IP) de communication proposĂ© permet le transfert de donnĂ©es entre des milliers de coprocesseurs sur le rĂ©seau, grĂące Ă  l’implĂ©mentation d’un rĂ©seau direct avec capacitĂ© de routage de paquets. Les rĂ©sultats expĂ©rimentaux ont montrĂ© une latence de seulement 34 cycles d’horloge entre deux noeuds voisins, ce qui est un des plus bas parmi ceux rapportĂ©s dans la littĂ©rature. En outre, nous proposons une architecture adaptĂ©e au calcul Ă  haute performance qui comporte un traitement extensible, parallĂšle et distribuĂ©. Pour une plateforme Ă  8 FPGA, l’architecture fournit 35.6Go/s de bande passante effective pour la mĂ©moire externe, une bande passante globale de rĂ©seau de 128Gbps et une puissance de calcul de 8.9GFLOPS. Un solveur matrice-vecteur de grande taille est partitionnĂ© et mis en oeuvre Ă  travers le cluster. Nous avons obtenu une performance et une efficacitĂ© de calcul concurrentielles grĂące Ă  la faible empreinte du protocole de communication entre les Ă©lĂ©ments de traitement distribuĂ©s. Ce travail contribue Ă  soutenir de nouvelles recherches dans le domaine du calcul parallĂšle intensif et permet le partitionnement de systĂšme sur puce Ă  grande taille sur des clusters Ă  base de FPGA.----------ABSTRACT An FPGA-based cluster profits from the flexibility and the performance potential FPGA technology provides. Since price and power consumption are becoming increasingly important elements in the High-Performance Computing market, the multi-FPGA exploration field is getting more popular each year. Network latency has failed to keep up with other improvements in computer performance. Complex communication stacks have sacrificed latency and increased overhead to achieve other goals, being in most of the time inefficient and unnecessary. The existing commercial offthe- shelf communication protocols and Network Interfaces Controllers for FPGA-to-FPGA interconnection lack of performance to support time-critical applications and tightly coupled System-on-Chip partitioning. Instead, custom communication approaches are preferred. In this work, ultra-low latency and high-speed communication channels for an FPGA-based cluster are presented. Two BEE3s grouping 8 FPGAs Virtex-5 interconnected in a ring topology, compose the targeting platform. Our approach exploits Multi-Gigabit Transceiver technology to achieve reliable 8Gbps channel bandwidth. The proposed communication IP supports data transfer from coprocessors over the network, by means of a direct network implementation with hop-by-hop packet routing capability. Experimental results showed a latency of only 34 clock cycles between two neighboring nodes, being one of the lowest in the literature. In addition, it is proposed an architecture suitable for High-Performance Computing which includes performing scalable, parallel, and distributed processing. For an 8 FPGAs platform, the architecture provides 35.6GB/s off-chip memory throughput, 128Gbps network aggregate bandwidth, and 8.9GFLOPS computing power. A large and dense matrix-vector solver is partitioned and implemented across the cluster. We achieved competitive performance and computational efficiency as a result of the low communication overhead among the distributed processing elements. This work contributes to support new researches on the intense parallel computing fields, and enables large System-on-Chip partitioning and scaling on FPGA-based clusters

    PiCo: A Domain-Specific Language for Data Analytics Pipelines

    Get PDF
    In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models—for which only informal (and often confusing) semantics is generally provided—all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks. From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics. The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level. Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world
    • 

    corecore