193 research outputs found
Automatic Creation of High-Bandwidth Memory Architectures from Domain-Specific Languages: The Case of Computational Fluid Dynamics
Numerical simulations can help solve complex problems. Most of these algorithms are massively parallel and thus good candidates for FPGA acceleration thanks to spatial parallelism. Modern FPGA devices can leverage high-bandwidth memory technologies, but when applications are memory-bound designers must craft advanced communication and memory architectures for efficient data movement and on-chip storage. This development process requires hardware design skills that are uncommon in domain-specific experts.
In this paper, we propose an automated tool flow from a domain-specific language (DSL) for tensor expressions to generate massively-parallel accelerators on HBM-equipped FPGAs. Designers can use this flow to integrate and evaluate various compiler or hardware optimizations. We use computational fluid dynamics (CFD) as a paradigmatic example.
Our flow starts from the high-level specification of tensor operations and combines an MLIR-based compiler with an in-house hardware generation flow to generate systems with parallel accelerators and a specialized memory architecture that moves data efficiently, aiming at fully exploiting the available CPU-FPGA bandwidth.
We simulated applications with millions of elements, achieving up to 103 GFLOPS with one compute unit and custom precision when targeting a Xilinx Alveo U280. Our FPGA implementation is up to 25x more energy efficient than expert-crafted Intel CPU implementations
Acceleration for the many, not the few
Although specialized hardware promises orders of magnitude performance gains, their
uptake has been limited by how challenging it is to program them. Hardware accelerators
present challenges programmers are not used to, exposing details of the hardware that
are often hidden and requiring new programming styles to use them effectively.
Existing programming models often involve learning complex and hardware-specific
APIs, using Domain Specific Languages (DSLs), or programming in customized assembly languages. These programming models for hardware accelerators present a
significant challenge to uptake: a steep, unforgiving, and untransferable learning curve.
However, programming hardware accelerators using traditional programming models
presents a challenge: mapping code not written with hardware accelerators in mind to
accelerators with restricted behaviour.
This thesis presents these challenges in the context of the acceleration equation, and
it presents solutions to it in three different contexts: for regular expression accelerators,
for API-programmable accelerators (with Fourier Transforms as a key case-study) and
for heterogeneous coarse-grained reconfigurable arrays (CGRAs). This thesis shows
that automatically morphing software written in traditional manners to fit hardware
accelerators is possible with no programmer effort and that huge potential speedups are
available
An Enhanced Hardware Description Language Implementation for Improved Design-Space Exploration in High-Energy Physics Hardware Design
Detectors in High-Energy Physics (HEP) have increased tremendously in accuracy, speed and integration. Consequently HEP experiments are confronted with an immense amount of data to be read out, processed and stored. Originally low-level processing has been accomplished in hardware, while more elaborate algorithms have been executed on large computing farms. Field-Programmable Gate Arrays (FPGAs) meet HEP's need for ever higher real-time processing performance by providing programmable yet fast digital logic resources. With the fast move from HEP Digital Signal Processing (DSPing) applications into the domain of FPGAs, related design tools are crucial to realise the potential performance gains. This work reviews Hardware Description Languages (HDLs) in respect to the special needs present in the HEP digital hardware design process. It is especially concerned with the question, how features outside the scope of mainstream digital hardware design can be implemented efficiently into HDLs. It will argue that functional languages are especially suitable for implementation of domain-specific languages, including HDLs. Casestudies examining the implementation complexity of HEP-specific language extensions to the functional HDCaml HDL will prove the viability of the suggested approach
Modelling and characterisation of distributed hardware acceleration
Hardware acceleration has become more commonly utilised in networked computing systems. The growing complexity of applications mean that traditional CPU architectures can no longer meet stringent latency constraints. Alternative computing
architectures such as GPUs and FPGAs are increasingly available, along with simpler, more software-like development flows. The work presented in this thesis characterises the overheads associated with these accelerator architectures. A holistic view encompassing both computation and communication latency must be considered. Experimental results obtained through this work show that networkattached accelerators scale better than server-hosted deployments, and that host ingestion overheads are comparable to network traversal times in some cases. Along with the choice of processing platforms, it is becoming more important to consider
how workloads are partitioned and where in the network tasks are being performed. Manual allocation and evaluation of tasks to network nodes does not scale with network and workload complexity. A mathematical formulation of this problem is presented within this thesis that takes into account all relevant performance metrics.
Unlike other works, this model takes into account growing hardware heterogeneity and workload complexity, and is generalisable to a range of scenarios. This model can be used in an optimisation that generates lower cost results with latency performance close to theoretical maximums compared to naive placement approaches.
With the mathematical formulation and experimental results that characterise hardware accelerator overheads, the work presented in this thesis can be used to make informed design decisions about both where to allocate tasks and deploy accelerators in the network, and the associated costs
Template-based hardware-software codesign for high-performance embedded numerical accelerators
Thesis (Ph. D.)--Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (pages 129-132).Sophisticated algorithms for control, state estimation and equalization have tremendous potential to improve performance and create new capabilities in embedded and mobile systems. Traditional implementation approaches are not well suited for porting these algorithmic solutions into practical implementations within embedded system constraints. Most of the technical challenges arise from design approach that manipulates only one level in the design stack, thus being forced to conform to constraints imposed by other levels without question. In tightly constrained environments, like embedded and mobile systems, such approaches have a hard time efficiently delivering and delivering efficiency. In this work we offer a solution that cuts through all the design stack layers. We build flexible structures at the hardware, software and algorithm level, and approach the solution through design space exploration. To do this efficiently we use a template-based hardware-software development flow. The main incentive for template use is, as in software development, to relax the generality vs. efficiency/performance type tradeoffs that appear in solutions striving to achieve run-time flexibility. As a form of static polymorphism, templates typically incur very little performance overhead once the design is instantiated, thus offering the possibility to defer many design decisions until later stages when more is known about the overall system design. However, simply including templates into design flow is not sufficient to result in benefits greater than some level of code reuse. In our work we propose using templates as flexible interfaces between various levels in the design stack. As such, template parameters become the common language that designers at different levels of design hierarchy can use to succinctly express their assumptions and ideas. Thus, it is of great benefit if template parameters map directly and intuitively into models at every level. To showcase the approach we implement a numerical accelerator for embedded Model Predictive Control (MPC) algorithm. While most of this work and design flow are quite general, their full power is realized in search for good solutions to a specific problem. This is best understood in direct comparison with recent works on embedded and high-speed MPC implementations. The controllers we generate outperform published works by a handsome margin in both speed and power consumption, while taking very little time to generate.by Ranko Radovin Sredojević.Ph.D
Fully Programming the Data Plane: A Hardware/Software Approach
Les réseaux définis par logiciel — en anglais Software-Defined Networking (SDN) — sont apparus ces dernières années comme un nouveau paradigme de réseau. SDN introduit une séparation entre les plans de gestion, de contrôle et de données, permettant à ceux-ci d’évoluer de manière indépendante, rompant ainsi avec la rigidité des réseaux traditionnels. En particulier, dans le plan de données, les avancées récentes ont porté sur la définition des langages
de traitement de paquets, tel que P4, et sur la définition d’architectures de commutateurs programmables, par exemple la Protocol Independent Switch Architecture (PISA). Dans cette thèse, nous nous intéressons a l’architecture PISA et évaluons comment exploiter les FPGA comme plateforme de traitement efficace de paquets. Cette problématique est
étudiée a trois niveaux d’abstraction : microarchitectural, programmation et architectural. Au niveau microarchitectural, nous avons proposé une architecture efficace d’un analyseur d’entêtes de paquets pour PISA. L’analyseur de paquets utilise une architecture pipelinée avec propagation en avant — en anglais feed-forward. La complexité de l’architecture est réduite par rapport à l’état de l’art grâce a l’utilisation d’optimisations algorithmiques. Finalement, l’architecture est générée par un compilateur P4 vers C++, combiné à un outil de synthèse de haut niveau. La solution proposée atteint un débit de 100 Gb/s avec une latence comparable à celle d’analyseurs d’entêtes de paquets écrits à la main. Au niveau de la programmation, nous avons proposé une nouvelle méthodologie de conception de synthèse de haut niveau visant à améliorer conjointement la qualité logicielle et matérielle. Nous exploitons les fonctionnalités du C++ moderne pour améliorer à la fois la modularité et la lisibilité du code, tout en conservant (ou améliorant) les résultats du matériel généré.
Des exemples de conception utilisant notre méthodologie, incluant pour l’analyseur d’entête de paquets, ont été rendus publics.----------ABSTRACT: Software-Defined Networking (SDN) has emerged in recent years as a new network paradigm to de-ossify communication networks. Indeed, by offering a clear separation of network concerns
between the management, control, and data planes, SDN allows each of these planes to evolve independently, breaking the rigidity of traditional networks. However, while well
spread in the control and management planes, this de-ossification has only recently reached the data plane with the advent of packet processing languages, e.g. P4, and novel programmable switch architectures, e.g. Protocol Independent Switch Architecture (PISA). In this work, we focus on leveraging the PISA architecture by mainly exploiting the FPGA capabilities for efficient packet processing. In this way, we address this issue at different
abstraction levels: i) microarchitectural; ii) programming; and, iii) architectural. At the microarchitectural level, we have proposed an efficient FPGA-based packet parser
architecture, which is a major PISA’s component. The proposed packet parser follows a feedforward
pipeline architecture in which the internal microarchitectural has been meticulously optimized for FPGA implementation. The architecture is automatically generated by a P4- to-C++ compiler after several rounds of graph optimizations. The proposed solution achieves 100 Gb/s line rate with latency comparable to hand-written packet parsers. The throughput scales from 10 Gb/s to 160 Gb/s with moderate increase in resource consumption. Both the compiler and the packet parser codebase have been open-sourced to permit reproducibility. At the programming level, we have proposed a novel High-Level Synthesis (HLS) design methodology aiming at improving software and hardware quality. We have employed this novel methodology when designing the packet parser. In our work, we have exploited features of modern C++ that improves at the same time code modularity and readability while keeping (or improving) the results of the generated hardware. Design examples using our methodology have been publicly released
Recommended from our members
Efficient FPGA implementation and power modelling of image and signal processing IP cores
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Field Programmable Gate Arrays (FPGAs) are the technology of choice in a number ofimage
and signal processing application areas such as consumer electronics, instrumentation,
medical data processing and avionics due to their reasonable energy consumption, high performance, security, low design-turnaround time and reconfigurability. Low power FPGA
devices are also emerging as competitive solutions for mobile and thermally constrained platforms. Most computationally intensive image and signal processing algorithms also consume a lot of power leading to a number of issues including reduced mobility, reliability concerns and increased design cost among others. Power dissipation has become one of the most important challenges, particularly for FPGAs. Addressing this problem requires optimisation and awareness at all levels in the design flow. The key achievements of the
work presented in this thesis are summarised here. Behavioural level optimisation strategies have been used for implementing matrix product and inner product through the use of mathematical techniques such as Distributed Arithmetic (DA) and its variations including offset binary coding, sparse factorisation and novel vector level transformations. Applications to test the impact of these algorithmic and arithmetic transformations include the fast Hadamard/Walsh transforms and Gaussian mixture models. Complete design space exploration has been performed on these cores, and where appropriate, they have been shown to clearly outperform comparable existing implementations. At the architectural level, strategies such as parallelism, pipelining and systolisation have been successfully applied for the design and optimisation of a number of
cores including colour space conversion, finite Radon transform, finite ridgelet transform and circular convolution. A pioneering study into the influence of supply voltage scaling for FPGA based designs, used in conjunction with performance enhancing strategies such as parallelism and pipelining has been performed. Initial results are very promising and indicated significant potential for future research in this area.
A key contribution of this work includes the development of a novel high level power macromodelling technique for design space exploration and characterisation of custom IP cores for FPGAs, called Functional Level Power Analysis and Modelling (FLPAM). FLPAM
is scalable, platform independent and compares favourably with existing approaches. A hybrid, top-down design flow paradigm integrating FLPAM with commercially available design tools for systematic optimisation of IP cores has also been developed
A sensor node soC architecture for extremely autonomous wireless sensor networks
Tese de Doutoramento em Engenharia Eletrónica e de Computadores (PDEEC) (especialidade em Informática Industrial e Sistemas Embebidos)The Internet of Things (IoT) is revolutionizing the Internet of the future and the
way new smart objects and people are being connected into the world. Its pervasive
computing and communication technologies connect myriads of smart devices, presented
at our everyday things and surrounding objects. Big players in the industry
forecast, by 2020, around 50 billion of smart devices connected in a multitude of scenarios
and heterogeneous applications, sharing data over a true worldwide network.
This will represent a trillion dollar market that everyone wants to take a share.
In a world where everything is being connected, device security and device interoperability
are a paramount. From the sensor to the cloud, this triggers several
technological issues towards connectivity, interoperability and security requirements
on IoT devices. However, fulfilling such requirements is not straightforward. While
the connectivity exposes the device to the Internet, which also raises several security
issues, deploying a standardized communication stack on the endpoint device
in the network edge, highly increases the data exchanged over the network. Moreover,
handling such ever-growing amount of data on resource-constrained devices,
truly affects the performance and the energy consumption. Addressing such issues
requires new technological and architectural approaches to help find solutions to
leverage an accelerated, secure and energy-aware IoT end-device communication.
Throughout this thesis, the developed artifacts triggered the achievement of important
findings that demonstrate: (1) how heterogeneous architectures are nowadays
a perfect solution to deploy endpoint devices in scenarios where not only (heavy
processing) application-specific operations are required, but also network-related capabilities
are major concerns; (2) how accelerating network-related tasks result in a
more efficient device resources utilization, which combining better performance and
increased availability, contributed to an improved overall energy utilization; (3) how
device and data security can benefit from modern heterogeneous architectures that
rely on secure hardware platforms, which are also able to provide security-related
acceleration hardware; (4) how a domain-specific language eases the co-design and
customization of a secure and accelerated IoT endpoint device at the network edge.Internet of Things (IoT) é o conceito que está a revolucionar a Internet do futuro
e a forma como coisas, processos e pessoas se conectam e se relacionam numa infraestrutura
de rede global que interligará, num futuro próximo, um vasto número de
dispositivos inteligentes e de utilização diária. Com uma grande aposta no mercado
IoT por parte dos grandes lÃderes na industria, algumas visões otimistas preveem
para 2020 mais de 50 mil milhões de dispositivos ligados na periferia da rede, partilhando
grandes volumes de dados importantes através da Internet, representando
um mercado multimilionário com imensas oportunidades de negócio.
Num mundo interligado de dispositivos, a interoperabilidade e a segurança é uma
preocupação crescente. Tal preocupação exige inúmeros esforços na exploração de
novas soluções, quer a nÃvel tecnológico quer a nÃvel arquitetural, que visem impulsionar
o desenvolvimento de dispositivos embebidos com maiores capacidades de
desempenho, segurança e eficiência energética, não só apenas do dispositivo em si,
mas também das camadas e protocolos de rede associados. Apesar da integração
de pilhas de comunicação e de protocolos standard das camadas de rede solucionar
problemas associados à conectividade e a interoperabilidade, adiciona a sobrecarga
inerente dos protocolos de comunicação e do crescente volume de dados partilhados
entre os dispositivos e a Internet, afetando severamente o desempenho e a disponibilidade
do mesmo, refletindo-se num maior consumo energético global.
As soluções apresentadas nesta tese permitiram obter resultados que demonstram:
(1) a viabilidade de soluções heterogéneas no desenvolvimento de dispositivos IoT,
onde não só tarefas inerentes à aplicação podem ser aceleradas, mas também tarefas
relacionadas com a comunicação do dispositivo; (2) os benefÃcios da aceleração de
tarefas e protocolos da pilha de rede, que se traduz num melhor desempenho do
dispositivo e aumento da disponibilidade do mesmo, contribuindo para uma melhor
eficiência energética; (3) que plataformas de hardware modernas oferecem mecanismos
de segurança que podem ser utilizados não apenas em prol da segurança do
dispositivo, mas também nas capacidades de comunicação do mesmo; (4) que o desenvolvimento
de uma linguagem de domÃnio especÃfico permite de forma mais eficaz
e eficiente o desenvolvimento e configuração de dispositivos IoT inteligentes.This thesis was supported by a PhD scholarship from Fundação para a Ciência e Tecnologia, SFRH/BD/90162/201
- …