156 research outputs found
Hardware Acceleration of Electronic Design Automation Algorithms
With the advances in very large scale integration (VLSI) technology, hardware is going
parallel. Software, which was traditionally designed to execute on single core microprocessors,
now faces the tough challenge of taking advantage of this parallelism, made available
by the scaling of hardware. The work presented in this dissertation studies the acceleration
of electronic design automation (EDA) software on several hardware platforms such
as custom integrated circuits (ICs), field programmable gate arrays (FPGAs) and graphics
processors. This dissertation concentrates on a subset of EDA algorithms which are heavily
used in the VLSI design flow, and also have varying degrees of inherent parallelism
in them. In particular, Boolean satisfiability, Monte Carlo based statistical static timing
analysis, circuit simulation, fault simulation and fault table generation are explored. The
architectural and performance tradeoffs of implementing the above applications on these
alternative platforms (in comparison to their implementation on a single core microprocessor)
are studied. In addition, this dissertation also presents an automated approach to
accelerate uniprocessor code using a graphics processing unit (GPU). The key idea is to
partition the software application into kernels in an automated fashion, such that multiple
instances of these kernels, when executed in parallel on the GPU, can maximally benefit
from the GPU?s hardware resources.
The work presented in this dissertation demonstrates that several EDA algorithms can
be successfully rearchitected to maximally harness their performance on alternative platforms
such as custom designed ICs, FPGAs and graphic processors, and obtain speedups upto 800X. The approaches in this dissertation collectively aim to contribute towards enabling
the computer aided design (CAD) community to accelerate EDA algorithms on arbitrary
hardware platforms
Methoden und Beschreibungssprachen zur Modellierung und Verifikation vonSchaltungen und Systemen: MBMV 2015 - Tagungsband, Chemnitz, 03. - 04. März 2015
Der Workshop Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (MBMV 2015) findet nun schon zum 18. mal statt. Ausrichter sind in diesem Jahr die Professur Schaltkreis- und Systementwurf der Technischen Universität Chemnitz und das Steinbeis-Forschungszentrum Systementwurf und Test.
Der Workshop hat es sich zum Ziel gesetzt, neueste Trends, Ergebnisse und aktuelle Probleme auf dem Gebiet der Methoden zur Modellierung und Verifikation sowie der Beschreibungssprachen digitaler, analoger und Mixed-Signal-Schaltungen zu diskutieren. Er soll somit ein Forum zum Ideenaustausch sein.
Weiterhin bietet der Workshop eine Plattform für den Austausch zwischen Forschung und Industrie sowie zur Pflege bestehender und zur Knüpfung neuer Kontakte. Jungen Wissenschaftlern erlaubt er, ihre Ideen und Ansätze einem breiten Publikum aus Wissenschaft und Wirtschaft zu präsentieren und im Rahmen der Veranstaltung auch fundiert zu diskutieren. Sein langjähriges Bestehen hat ihn zu einer festen Größe in vielen Veranstaltungskalendern gemacht. Traditionell sind auch die Treffen der ITGFachgruppen an den Workshop angegliedert.
In diesem Jahr nutzen zwei im Rahmen der InnoProfile-Transfer-Initiative durch das Bundesministerium für Bildung und Forschung geförderte Projekte den Workshop, um in zwei eigenen Tracks ihre Forschungsergebnisse einem breiten Publikum zu präsentieren. Vertreter der Projekte Generische Plattform für Systemzuverlässigkeit und Verifikation (GPZV) und GINKO - Generische Infrastruktur zur nahtlosen energetischen Kopplung von Elektrofahrzeugen stellen Teile ihrer gegenwärtigen Arbeiten vor. Dies bereichert denWorkshop durch zusätzliche Themenschwerpunkte und bietet eine wertvolle Ergänzung zu den Beiträgen der Autoren. [... aus dem Vorwort
Synthesis of FPGA-based accelerators implementing recursive algorithms
Doutoramento em Engenharia InformáticaO desenvolvimento de sistemas computacionais é um processo complexo, com múltiplas etapas, que requer uma análise profunda do problema, levando
em consideração as limitações e os requisitos aplicáveis. Tal tarefa envolve a exploração de técnicas alternativas e de algoritmos computacionais para optimizar o sistema e satisfazer os requisitos estabelecidos. Neste contexto, uma das mais importantes etapas é a análise e implementação de algoritmos
computacionais.
Enormes avanços tecnológicos no âmbito das FPGAs (Field-Programmable Gate Arrays) tornaram possível o desenvolvimento de sistemas de engenharia extremamente complexos. Contudo, o número de transístores disponíveis por chip está a crescer mais rapidamente do que a capacidade que temos para desenvolver sistemas que tirem proveito desse crescimento. Esta limitação já bem conhecida, antes de se revelar com FPGAs, já se verificava com ASICs (Application-Specific Integrated Circuits) e tem vindo a aumentar
continuamente.
O desenvolvimento de sistemas com base em FPGAs de alta capacidade envolve uma grande variedade de ferramentas, incluindo métodos para a implementação eficiente de algoritmos computacionais. Esta tese pretende proporcionar uma contribuição nesta área, tirando partido da reutilização, do aumento do nível de abstracção e de especificações algorítmicas mais automatizadas e claras. Mais especificamente, é apresentado um estudo que foi levado a cabo no sentido de obter critérios relativos à implementação em hardware de algoritmos recursivos versus iterativos. Depois de serem apresentadas algumas das estratégias para implementar recursividade em hardware mais significativas, descreve-se, em pormenor, um conjunto de algoritmos para resolver problemas de pesquisa combinatória (considerados enquanto exemplos de aplicação). Versões recursivas e iterativas destes algoritmos foram implementados e testados em FPGA. Com base nos resultados obtidos, é feita uma cuidada análise comparativa.
Novas ferramentas e técnicas de investigação que foram desenvolvidas no âmbito desta tese são também discutidas e demonstradas.Design of computational systems is a complex multistage process which requires a deep analysis of the problem, taking into account relevant limitations and constraints as well as software/hardware co-design. Such task involves
exploring competitive techniques and computational algorithms, enabling the system to be optimized while satisfying given requirements. In this context, one of the most important stages is analysis and implementation of computational
algorithms.
Tremendous progress in the scope of FPGA (Field-Programmable Gate Array) technology has made it possible to design very complicated engineering systems. However, the number of available transistors grows faster than the ability to meaningfully design with them. This situation is a well known design productivity gap, which was inherited by FPGA from ASIC (Application-Specific
Integrated Circuit) and which is increasing continuously.
Developing engineering systems on the basis of high capacity FPGAs involves a wide variety of design tools, including methods for efficient implementation of computational algorithms. The thesis is intended to provide a contribution in this area by aiming at reuse, high level abstraction, automation, and clearness of algorithmic specifications. More specifically, it presents research studies which have been carried out in order to obtain criteria regarding implementation
of recursive vs. iterative algorithms in hardware. After describing some of the most relevant strategies for implementing recursion in hardware, a selection of algorithms for solving combinatorial search problems (considered as application examples) are also described in detail. Iterative and recursive versions of these algorithms have been implemented and tested in FPGA.
Taking into consideration the results obtained, a careful comparative analysis is given.
New research-oriented tools and techniques for hardware design which have been developed in the scope of this thesis are also discussed and
demonstrated
Study of Fine-Grained, Irregular Parallel Applications on a Many-Core Processor
This dissertation demonstrates the possibility of obtaining strong speedups for a variety of parallel applications versus the best serial and parallel implementations on commodity platforms. These results were obtained using the PRAM-inspired Explicit Multi-Threading (XMT) many-core computing platform, which is designed to efficiently support execution of both serial and parallel code and switching between the two.
Biconnectivity: For finding the biconnected components of a graph, we demonstrate speedups of 9x to 33x on XMT relative to the best serial algorithm using a relatively modest silicon budget. Further evidence suggests that speedups of 21x to 48x are possible. For graph connectivity, we demonstrate that XMT outperforms two contemporary NVIDIA GPUs of similar or greater silicon area. Prior studies of parallel biconnectivity algorithms achieved at most a 4x speedup, but we could not find biconnectivity code for GPUs to compare biconnectivity against them.
Triconnectivity: We present a parallel solution to the problem of determining the triconnected components of an undirected graph. We obtain significant speedups on XMT over the only published optimal (linear-time) serial implementation of a triconnected components algorithm running on a modern CPU. To our knowledge, no other parallel implementation of a triconnected components algorithm has been published for any platform.
Burrows-Wheeler compression: We present novel work-optimal parallel algorithms for Burrows-Wheeler compression and decompression of strings over a constant alphabet and their empirical evaluation. To validate these theoretical algorithms, we implement them on XMT and show speedups of up to 25x for compression, and 13x for decompression, versus bzip2, the de facto standard implementation of Burrows-Wheeler compression.
Fast Fourier transform (FFT): Using FFT as an example, we examine the impact that adoption of some enabling technologies, including silicon photonics, would have on the performance of a many-core architecture. The results show that a single-chip many-core processor could potentially outperform a large high-performance computing cluster.
Boosted decision trees: This chapter focuses on the hybrid memory architecture of the XMT computer platform, a key part of which is a flexible all-to-all interconnection network that connects processors to shared memory modules. First, to understand some recent advances in GPU memory architecture and how they relate to this hybrid memory architecture, we use microbenchmarks including list ranking. Then, we contrast the scalability of applications with that of routines. In particular, regardless of the scalability needs of full applications, some routines may involve smaller problem sizes, and in particular smaller levels of parallelism, perhaps even serial. To see how a hybrid memory architecture can benefit such applications, we simulate a computer with such an architecture and demonstrate the potential for a speedup of 3.3X over NVIDIA's most powerful GPU to date for XGBoost, an implementation of boosted decision trees, a timely machine learning approach.
Boolean satisfiability (SAT): SAT is an important performance-hungry problem with applications in many problem domains. However, most work on parallelizing SAT solvers has focused on coarse-grained, mostly embarrassing parallelism. Here, we study fine-grained parallelism that can speed up existing sequential SAT solvers. We show the potential for speedups of up to 382X across a variety of problem instances. We hope that these results will stimulate future research
Precision analysis for hardware acceleration of numerical algorithms
The precision used in an algorithm affects the error and performance of individual computations, the
memory usage, and the potential parallelism for a fixed hardware budget. However, when migrating
an algorithm onto hardware, the potential improvements that can be obtained by tuning the precision
throughout an algorithm to meet a range or error specification are often overlooked; the major reason
is that it is hard to choose a number system which can guarantee any such specification can be met.
Instead, the problem is mitigated by opting to use IEEE standard double precision arithmetic so as to be
‘no worse’ than a software implementation. However, the flexibility in the number representation is one
of the key factors that can be exploited on reconfigurable hardware such as FPGAs, and hence ignoring
this potential significantly limits the performance achievable.
In order to optimise the performance of hardware reliably, we require a method that can tractably
calculate tight bounds for the error or range of any variable within an algorithm, but currently only a
handful of methods to calculate such bounds exist, and these either sacrifice tightness or tractability,
whilst simulation-based methods cannot guarantee the given error estimate. This thesis presents a new
method to calculate these bounds, taking into account both input ranges and finite precision effects,
which we show to be, in general, tighter in comparison to existing methods; this in turn can be used to
tune the hardware to the algorithm specifications.
We demonstrate the use of this software to optimise hardware for various algorithms to accelerate
the solution of a system of linear equations, which forms the basis of many problems in engineering
and science, and show that significant performance gains can be obtained by using this new approach in
conjunction with more traditional hardware optimisations
Hardware synthesis from high-level scenario specifications
PhD ThesisThe behaviour of many systems can be partitioned into scenarios. These facilitate
engineers’ understanding of the specifications, and can be composed into efficient
implementations via a form of high-level synthesis. In this work, we focus on highly
concurrent systems, whose scenarios are typically described using concurrency models
such as partial orders, Petri nets and data-flow structures.
In this thesis, we study different aspects of hardware synthesis from high-level
scenario specifications. We propose new formal models to simplify the specification
of concurrent systems, and algorithms for hardware synthesis and verification of the
scenario-based models of such systems. We also propose solutions for mapping scenariobased
systems on silicon and evaluate their efficiency.
Our experiments show that the proposed approaches improve the design of concurrent
systems. The new formalisms can break down complex specifications into
significantly simpler scenarios automatically, and can be used to fully model the dataflow
of operations of reconfigurable event-driven systems. The proposed heuristics for
mapping the scenarios of a system to a digital circuit supports encoding constraints,
unlike existing methods, and can cope with specifications comprising hundreds of
scenarios at the cost of only 5% of area overhead compared to exact algorithms.
These experiments are driven by three case studies: (1) hardware synthesis of control
architectures, e.g. microprocessor control units; (2) acceleration of the ordinal pattern
encoding, i.e. an algorithm for detecting repetitive patterns within data streams; (3) and
acceleration of computational drug discovery, i.e. computation of shortest paths in large
protein-interaction networks.
Our findings are employed to design two prototypes, which have a practical value for
the considered case studies. The ordinal pattern encoding accelerator is asynchronous,
highly resilient to unstable voltage supply, and designed to perform a range of computations
via runtime reconfiguration. The drug discovery accelerator is synchronous, and
up to three orders of magnitude faster than conventional software implementations
A Survey of Recent Developments in Testability, Safety and Security of RISC-V Processors
With the continued success of the open RISC-V architecture, practical deployment of RISC-V processors necessitates an in-depth consideration of their testability, safety and security aspects. This survey provides an overview of recent developments in this quickly-evolving field. We start with discussing the application of state-of-the-art functional and system-level test solutions to RISC-V processors. Then, we discuss the use of RISC-V processors for safety-related applications; to this end, we outline the essential techniques necessary to obtain safety both in the functional and in the timing domain and review recent processor designs with safety features. Finally, we survey the different aspects of security with respect to RISC-V implementations and discuss the relationship between cryptographic protocols and primitives on the one hand and the RISC-V processor architecture and hardware implementation on the other. We also comment on the role of a RISC-V processor for system security and its resilience against side-channel attacks
- …