Search CORE

193 research outputs found

Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics

Author: Christian Pilato
Gerald Hempel
Jeronimo Castrillon
Karl F. A. Friebel
Mattia Tibaldi
Stephanie Soldavini
Publication venue
Publication date: 27/07/2022
Field of study

Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts. In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example. Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth. We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25x more energy efficient than expert-crafted Intel CPU implementations

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

Acceleration for the many, not the few

Author: Woodruff Jackson
Publication venue: The University of Edinburgh
Publication date: 06/08/2024
Field of study

Although specialized hardware promises orders of magnitude performance gains, their uptake has been limited by how challenging it is to program them. Hardware accelerators present challenges programmers are not used to, exposing details of the hardware that are often hidden and requiring new programming styles to use them effectively. Existing programming models often involve learning complex and hardware-specific APIs, using Domain Specific Languages (DSLs), or programming in customized assembly languages. These programming models for hardware accelerators present a significant challenge to uptake: a steep, unforgiving, and untransferable learning curve. However, programming hardware accelerators using traditional programming models presents a challenge: mapping code not written with hardware accelerators in mind to accelerators with restricted behaviour. This thesis presents these challenges in the context of the acceleration equation, and it presents solutions to it in three different contexts: for regular expression accelerators, for API-programmable accelerators (with Fourier Transforms as a key case-study) and for heterogeneous coarse-grained reconfigurable arrays (CGRAs). This thesis shows that automatically morphing software written in traditional manners to fit hardware accelerators is possible with no programmer effort and that huge potential speedups are available

Edinburgh Research Archive

An Enhanced Hardware Description Language Implementation for Improved Design-Space Exploration in High-Energy Physics Hardware Design

Author: Mücke M
Publication venue: Graz, Tech. U.
Publication date: 01/01/2007
Field of study

Detectors in High-Energy Physics (HEP) have increased tremendously in accuracy, speed and integration. Consequently HEP experiments are confronted with an immense amount of data to be read out, processed and stored. Originally low-level processing has been accomplished in hardware, while more elaborate algorithms have been executed on large computing farms. Field-Programmable Gate Arrays (FPGAs) meet HEP's need for ever higher real-time processing performance by providing programmable yet fast digital logic resources. With the fast move from HEP Digital Signal Processing (DSPing) applications into the domain of FPGAs, related design tools are crucial to realise the potential performance gains. This work reviews Hardware Description Languages (HDLs) in respect to the special needs present in the HEP digital hardware design process. It is especially concerned with the question, how features outside the scope of mainstream digital hardware design can be implemented efficiently into HDLs. It will argue that functional languages are especially suitable for implementation of domain-specific languages, including HDLs. Casestudies examining the implementation complexity of HEP-specific language extensions to the functional HDCaml HDL will prove the viability of the suggested approach

CERN Document Server

HETEROGENEOUS SYSTEM DESIGN AND OPTIMISATION FOR EMBEDDED VISION SYSTEMS

Author: Jiang Chao
Publication venue
Publication date: 01/08/2020
Field of study

The University of Manchester - Institutional Repository

Modelling and characterisation of distributed hardware acceleration

Author: Cooke Ryan A.
Publication venue
Publication date
Field of study

Hardware acceleration has become more commonly utilised in networked computing systems. The growing complexity of applications mean that traditional CPU architectures can no longer meet stringent latency constraints. Alternative computing architectures such as GPUs and FPGAs are increasingly available, along with simpler, more software-like development flows. The work presented in this thesis characterises the overheads associated with these accelerator architectures. A holistic view encompassing both computation and communication latency must be considered. Experimental results obtained through this work show that networkattached accelerators scale better than server-hosted deployments, and that host ingestion overheads are comparable to network traversal times in some cases. Along with the choice of processing platforms, it is becoming more important to consider how workloads are partitioned and where in the network tasks are being performed. Manual allocation and evaluation of tasks to network nodes does not scale with network and workload complexity. A mathematical formulation of this problem is presented within this thesis that takes into account all relevant performance metrics. Unlike other works, this model takes into account growing hardware heterogeneity and workload complexity, and is generalisable to a range of scenarios. This model can be used in an optimisation that generates lower cost results with latency performance close to theoretical maximums compared to naive placement approaches. With the mathematical formulation and experimental results that characterise hardware accelerator overheads, the work presented in this thesis can be used to make informed design decisions about both where to allocate tasks and deploy accelerators in the network, and the associated costs

Warwick Research Archives Portal Repository

Template-based hardware-software codesign for high-performance embedded numerical accelerators

Author: Sredojević Ranko Radovin
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2013
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (pages 129-132).Sophisticated algorithms for control, state estimation and equalization have tremendous potential to improve performance and create new capabilities in embedded and mobile systems. Traditional implementation approaches are not well suited for porting these algorithmic solutions into practical implementations within embedded system constraints. Most of the technical challenges arise from design approach that manipulates only one level in the design stack, thus being forced to conform to constraints imposed by other levels without question. In tightly constrained environments, like embedded and mobile systems, such approaches have a hard time efficiently delivering and delivering efficiency. In this work we offer a solution that cuts through all the design stack layers. We build flexible structures at the hardware, software and algorithm level, and approach the solution through design space exploration. To do this efficiently we use a template-based hardware-software development flow. The main incentive for template use is, as in software development, to relax the generality vs. efficiency/performance type tradeoffs that appear in solutions striving to achieve run-time flexibility. As a form of static polymorphism, templates typically incur very little performance overhead once the design is instantiated, thus offering the possibility to defer many design decisions until later stages when more is known about the overall system design. However, simply including templates into design flow is not sufficient to result in benefits greater than some level of code reuse. In our work we propose using templates as flexible interfaces between various levels in the design stack. As such, template parameters become the common language that designers at different levels of design hierarchy can use to succinctly express their assumptions and ideas. Thus, it is of great benefit if template parameters map directly and intuitively into models at every level. To showcase the approach we implement a numerical accelerator for embedded Model Predictive Control (MPC) algorithm. While most of this work and design flow are quite general, their full power is realized in search for good solutions to a specific problem. This is best understood in direct comparison with recent works on embedded and high-speed MPC implementations. The controllers we generate outperform published works by a handsome margin in both speed and power consumption, while taking very little time to generate.by Ranko Radovin Sredojević.Ph.D

DSpace@MIT

Fully Programming the Data Plane: A Hardware/Software Approach

Author: Santiago dal Silva Jeferson
Publication venue
Publication date: 01/04/2020
Field of study

Les réseaux définis par logiciel — en anglais Software-Defined Networking (SDN) — sont apparus ces dernières années comme un nouveau paradigme de réseau. SDN introduit une séparation entre les plans de gestion, de contrôle et de données, permettant à ceux-ci d’évoluer de manière indépendante, rompant ainsi avec la rigidité des réseaux traditionnels. En particulier, dans le plan de données, les avancées récentes ont porté sur la définition des langages de traitement de paquets, tel que P4, et sur la définition d’architectures de commutateurs programmables, par exemple la Protocol Independent Switch Architecture (PISA). Dans cette thèse, nous nous intéressons a l’architecture PISA et évaluons comment exploiter les FPGA comme plateforme de traitement efficace de paquets. Cette problématique est étudiée a trois niveaux d’abstraction : microarchitectural, programmation et architectural. Au niveau microarchitectural, nous avons proposé une architecture efficace d’un analyseur d’entêtes de paquets pour PISA. L’analyseur de paquets utilise une architecture pipelinée avec propagation en avant — en anglais feed-forward. La complexité de l’architecture est réduite par rapport à l’état de l’art grâce a l’utilisation d’optimisations algorithmiques. Finalement, l’architecture est générée par un compilateur P4 vers C++, combiné à un outil de synthèse de haut niveau. La solution proposée atteint un débit de 100 Gb/s avec une latence comparable à celle d’analyseurs d’entêtes de paquets écrits à la main. Au niveau de la programmation, nous avons proposé une nouvelle méthodologie de conception de synthèse de haut niveau visant à améliorer conjointement la qualité logicielle et matérielle. Nous exploitons les fonctionnalités du C++ moderne pour améliorer à la fois la modularité et la lisibilité du code, tout en conservant (ou améliorant) les résultats du matériel généré. Des exemples de conception utilisant notre méthodologie, incluant pour l’analyseur d’entête de paquets, ont été rendus publics.----------ABSTRACT: Software-Defined Networking (SDN) has emerged in recent years as a new network paradigm to de-ossify communication networks. Indeed, by offering a clear separation of network concerns between the management, control, and data planes, SDN allows each of these planes to evolve independently, breaking the rigidity of traditional networks. However, while well spread in the control and management planes, this de-ossification has only recently reached the data plane with the advent of packet processing languages, e.g. P4, and novel programmable switch architectures, e.g. Protocol Independent Switch Architecture (PISA). In this work, we focus on leveraging the PISA architecture by mainly exploiting the FPGA capabilities for efficient packet processing. In this way, we address this issue at different abstraction levels: i) microarchitectural; ii) programming; and, iii) architectural. At the microarchitectural level, we have proposed an efficient FPGA-based packet parser architecture, which is a major PISA’s component. The proposed packet parser follows a feedforward pipeline architecture in which the internal microarchitectural has been meticulously optimized for FPGA implementation. The architecture is automatically generated by a P4- to-C++ compiler after several rounds of graph optimizations. The proposed solution achieves 100 Gb/s line rate with latency comparable to hand-written packet parsers. The throughput scales from 10 Gb/s to 160 Gb/s with moderate increase in resource consumption. Both the compiler and the packet parser codebase have been open-sourced to permit reproducibility. At the programming level, we have proposed a novel High-Level Synthesis (HLS) design methodology aiming at improving software and hardware quality. We have employed this novel methodology when designing the packet parser. In our work, we have exploited features of modern C++ that improves at the same time code modularity and readability while keeping (or improving) the results of the generated hardware. Design examples using our methodology have been publicly released

PolyPublie

Recommended from our members

Efficient FPGA implementation and power modelling of image and signal processing IP cores

Author: Chandrasekaran Shrutisagar
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2007
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Field Programmable Gate Arrays (FPGAs) are the technology of choice in a number ofimage and signal processing application areas such as consumer electronics, instrumentation, medical data processing and avionics due to their reasonable energy consumption, high performance, security, low design-turnaround time and reconfigurability. Low power FPGA devices are also emerging as competitive solutions for mobile and thermally constrained platforms. Most computationally intensive image and signal processing algorithms also consume a lot of power leading to a number of issues including reduced mobility, reliability concerns and increased design cost among others. Power dissipation has become one of the most important challenges, particularly for FPGAs. Addressing this problem requires optimisation and awareness at all levels in the design flow. The key achievements of the work presented in this thesis are summarised here. Behavioural level optimisation strategies have been used for implementing matrix product and inner product through the use of mathematical techniques such as Distributed Arithmetic (DA) and its variations including offset binary coding, sparse factorisation and novel vector level transformations. Applications to test the impact of these algorithmic and arithmetic transformations include the fast Hadamard/Walsh transforms and Gaussian mixture models. Complete design space exploration has been performed on these cores, and where appropriate, they have been shown to clearly outperform comparable existing implementations. At the architectural level, strategies such as parallelism, pipelining and systolisation have been successfully applied for the design and optimisation of a number of cores including colour space conversion, finite Radon transform, finite ridgelet transform and circular convolution. A pioneering study into the influence of supply voltage scaling for FPGA based designs, used in conjunction with performance enhancing strategies such as parallelism and pipelining has been performed. Initial results are very promising and indicated significant potential for future research in this area. A key contribution of this work includes the development of a novel high level power macromodelling technique for design space exploration and characterisation of custom IP cores for FPGAs, called Functional Level Power Analysis and Modelling (FLPAM). FLPAM is scalable, platform independent and compares favourably with existing approaches. A hybrid, top-down design flow paradigm integrating FLPAM with commercially available design tools for systematic optimisation of IP cores has also been developed

Brunel University Research Archive

A sensor node soC architecture for extremely autonomous wireless sensor networks

Author: Gomes Tiago Manuel Ribeiro
Publication venue
Publication date: 22/11/2017
Field of study

Tese de Doutoramento em Engenharia Eletrónica e de Computadores (PDEEC) (especialidade em Informática Industrial e Sistemas Embebidos)The Internet of Things (IoT) is revolutionizing the Internet of the future and the way new smart objects and people are being connected into the world. Its pervasive computing and communication technologies connect myriads of smart devices, presented at our everyday things and surrounding objects. Big players in the industry forecast, by 2020, around 50 billion of smart devices connected in a multitude of scenarios and heterogeneous applications, sharing data over a true worldwide network. This will represent a trillion dollar market that everyone wants to take a share. In a world where everything is being connected, device security and device interoperability are a paramount. From the sensor to the cloud, this triggers several technological issues towards connectivity, interoperability and security requirements on IoT devices. However, fulfilling such requirements is not straightforward. While the connectivity exposes the device to the Internet, which also raises several security issues, deploying a standardized communication stack on the endpoint device in the network edge, highly increases the data exchanged over the network. Moreover, handling such ever-growing amount of data on resource-constrained devices, truly affects the performance and the energy consumption. Addressing such issues requires new technological and architectural approaches to help find solutions to leverage an accelerated, secure and energy-aware IoT end-device communication. Throughout this thesis, the developed artifacts triggered the achievement of important findings that demonstrate: (1) how heterogeneous architectures are nowadays a perfect solution to deploy endpoint devices in scenarios where not only (heavy processing) application-specific operations are required, but also network-related capabilities are major concerns; (2) how accelerating network-related tasks result in a more efficient device resources utilization, which combining better performance and increased availability, contributed to an improved overall energy utilization; (3) how device and data security can benefit from modern heterogeneous architectures that rely on secure hardware platforms, which are also able to provide security-related acceleration hardware; (4) how a domain-specific language eases the co-design and customization of a secure and accelerated IoT endpoint device at the network edge.Internet of Things (IoT) é o conceito que está a revolucionar a Internet do futuro e a forma como coisas, processos e pessoas se conectam e se relacionam numa infraestrutura de rede global que interligará, num futuro próximo, um vasto número de dispositivos inteligentes e de utilização diária. Com uma grande aposta no mercado IoT por parte dos grandes líderes na industria, algumas visões otimistas preveem para 2020 mais de 50 mil milhões de dispositivos ligados na periferia da rede, partilhando grandes volumes de dados importantes através da Internet, representando um mercado multimilionário com imensas oportunidades de negócio. Num mundo interligado de dispositivos, a interoperabilidade e a segurança é uma preocupação crescente. Tal preocupação exige inúmeros esforços na exploração de novas soluções, quer a nível tecnológico quer a nível arquitetural, que visem impulsionar o desenvolvimento de dispositivos embebidos com maiores capacidades de desempenho, segurança e eficiência energética, não só apenas do dispositivo em si, mas também das camadas e protocolos de rede associados. Apesar da integração de pilhas de comunicação e de protocolos standard das camadas de rede solucionar problemas associados à conectividade e a interoperabilidade, adiciona a sobrecarga inerente dos protocolos de comunicação e do crescente volume de dados partilhados entre os dispositivos e a Internet, afetando severamente o desempenho e a disponibilidade do mesmo, refletindo-se num maior consumo energético global. As soluções apresentadas nesta tese permitiram obter resultados que demonstram: (1) a viabilidade de soluções heterogéneas no desenvolvimento de dispositivos IoT, onde não só tarefas inerentes à aplicação podem ser aceleradas, mas também tarefas relacionadas com a comunicação do dispositivo; (2) os benefícios da aceleração de tarefas e protocolos da pilha de rede, que se traduz num melhor desempenho do dispositivo e aumento da disponibilidade do mesmo, contribuindo para uma melhor eficiência energética; (3) que plataformas de hardware modernas oferecem mecanismos de segurança que podem ser utilizados não apenas em prol da segurança do dispositivo, mas também nas capacidades de comunicação do mesmo; (4) que o desenvolvimento de uma linguagem de domínio específico permite de forma mais eficaz e eficiente o desenvolvimento e configuração de dispositivos IoT inteligentes.This thesis was supported by a PhD scholarship from Fundação para a Ciência e Tecnologia, SFRH/BD/90162/201

Universidade do Minho: RepositoriUM