64 research outputs found
Metoda projektovanja namenskih programabilnih hardverskih akceleratora
Namenski računarski sistemi se najčesće projektuju tako da mogu da podrže
izvršavanje većeg broja željenih aplikacija. Za postizanje što veće efikasnosti,
preporučuje se korišćenje specijalizovanih procesora Application Specific Instruction
Set Processors–ASIPs, na kojima se izvršavanje programskih instrukcija obavlja u za to
projektovanim i nezavisnimhardverskim blokovima (akceleratorima). Glavni razlog za
postojanje nezavisnih akceleratora jeste postizanjemaksimalnog ubrzanja izvršavanja
instrukcija. Me ¯ dutim, ovakav pristup podrazumeva da je za svaki od blokova potrebno
projektovati integrisano (ASIC) kolo, čime se bitno povećava ukupna površina procesora.
Metod za smanjenje ukupne površine jeste primena DatapathMerging tehnike na
dijagrame toka podataka ulaznih aplikacija. Kao rezultat, dobija se jedan programabilni
hardverski akcelerator, sa mogućnosću izvršavanja svih željenih instrukcija. Međutim,
ovo ima negativne posledice na efikasnost sistema.
često se zanemaruje činjenica da, usled veoma ograničene fleksibilnosti ASIC hardverskih
akceleratora, specijalizovani procesori imaju i drugih nedostataka. Naime, u
slučaju izmena, ili prosto nadogradnje, specifikacije procesora u završnimfazama projektovanja,
neizbežna su velika kašnjenja i dodatni troškovi promene dizajna. U ovoj
tezi je pokazano da zahtevi za fleksibilnošću i efikasnošću ne moraju biti međusobno
isključivi. Demonstrirano je je da je moguce uneti ograničeni nivo fleksibilnosti hardvera
tokom dizajn procesa, tako da dobijeni hardverski akcelerator može da izvršava
ne samo aplikacije definisane na samom početku projektovanja, već i druge aplikacije,
pod uslovom da one pripadaju istom domenu. Drugim rečima, u tezi je prezentovana
metoda projektovanja fleksibilnih namenskih hardverskih akceleratora. Eksperimentalnom evaluacijom pokazano je da su tako dobijeni akceleratori u većini slučajeva
samo do 2 x veće površine ili 2 x većeg kašnjenja od akceleratora dobijenih primenom
DatapathMerging metode, koja pritom ne pruža ni malo dodatne fleksibilnosti.Typically, embedded systems are designed to support a limited set of target
applications. To efficiently execute those applications, they may employ Application
Specific Instruction Set Processors (ASIPs) enriched with carefully designed Instructions
Set Extension (ISEs) implemented in dedicated hardware blocks. The primary goal
when designing ISEs is efficiency, i.e. the highest possible speedup, which implies
synthesizing all critical computational kernels of the application dataflow graphs as
an Application Specific Integrated Circuit (ASICs). Yet, this can lead to high on-chip
area dedicated solely to ISEs. One existing approach to decrease this area by paying
a reasonable price of decreased efficiency is to perform datapath merging on input
dataflow graphs (DFGs) prior to generating the ASIC.
It is often neglected that even higher costs can be accidentally incurred due to the lack
of flexibility of such ISEs. Namely, if late design changes or specification upgrades happen,
significant time-to-market delays and nonrecurrent costs for redesigning the ISEs
and the corresponding ASIPs become inevitable. This thesis shows that flexibility and
efficiency are not mutually exclusive. It demonstrates that it is possible to introduce a
limited amount of hardware flexibility during the design process, such that the resulting
datapath is in fact reconfigurable and thus can execute not only the applications known
at design time, but also other applications belonging to the same application-domain.
In other words, it proposes a methodology for designing domain-specific reconfigurable
arrays out of a limited set of input applications. The experimental results show that
resulting arrays are usually around 2£ larger and 2£ slower than ISEs synthesized using
datapath merging, which have practically null flexibility beyond the design set of DFGs
Towards Closing the Programmability-Efficiency Gap using Software-Defined Hardware
The past decade has seen the breakdown of two important trends in the computing industry: Moore’s law, an observation that the number of transistors in a chip roughly doubles every eighteen months, and Dennard scaling, that enabled the use of these transistors within a constant power budget. This has caused a surge in domain-specific accelerators, i.e. specialized hardware that deliver significantly better energy efficiency than general-purpose processors, such as CPUs. While the performance and efficiency of such accelerators are highly desirable, the fast pace of algorithmic innovation and non-recurring engineering costs have deterred their widespread use, since they are only programmable across a narrow set of applications. This has engendered a programmability-efficiency gap across contemporary platforms.
A practical solution that can close this gap is thus lucrative and is likely to engender broad impact in both academic research and the industry. This dissertation proposes such a solution with a reconfigurable Software-Defined Hardware (SDH) system that morphs parts of the hardware on-the-fly to tailor to the requirements of each application phase. This system is designed to deliver near-accelerator-level efficiency across a broad set of applications, while retaining CPU-like programmability.
The dissertation first presents a fixed-function solution to accelerate sparse matrix multiplication, which forms the basis of many applications in graph analytics and scientific computing. The solution consists of a tiled hardware architecture, co-designed with the outer product algorithm for Sparse Matrix-Matrix multiplication (SpMM), that uses on-chip memory reconfiguration to accelerate each phase of the algorithm. A proof-of-concept is then presented in the form of a prototyped 40 nm Complimentary Metal-Oxide Semiconductor (CMOS) chip that demonstrates energy efficiency and performance per die area improvements of 12.6x and 17.1x over a high-end CPU, and serves as a stepping stone towards a full SDH system.
The next piece of the dissertation enhances the proposed hardware with reconfigurability of the dataflow and resource sharing modes, in order to extend acceleration support to a set of common parallelizable workloads. This reconfigurability lends the system the ability to cater to discrete data access and compute patterns, such as workloads with extensive data sharing and reuse, workloads with limited reuse and streaming access patterns, among others. Moreover, this system incorporates commercial cores and a prototyped software stack for CPU-level programmability. The proposed system is evaluated on a diverse set of compute-bound and memory-bound kernels that compose applications in the domains of graph analytics, machine learning, image and language processing. The evaluation shows average performance and energy-efficiency gains of 5.0x and 18.4x over the CPU.
The final part of the dissertation proposes a runtime control framework that uses low-cost monitoring of hardware performance counters to predict the next best configuration and reconfigure the hardware, upon detecting a change in phase or nature of data within the application. In comparison to prior work, this contribution targets multicore CGRAs, uses low-overhead decision tree based predictive models, and incorporates reconfiguration cost-awareness into its policies. Compared to the best-average static (non-reconfiguring) configuration, the dynamically reconfigurable system achieves a 1.6x improvement in performance-per-Watt in the Energy-Efficient mode of operation, or the same performance with 23% lower energy in the Power-Performance mode, for SpMM across a suite of real-world inputs. The proposed reconfiguration mechanism itself outperforms the state-of-the-art approach for dynamic runtime control by up to 2.9x in terms of energy-efficiency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169859/1/subh_1.pd
Customizing the Computation Capabilities of Microprocessors.
Designers of microprocessor-based systems must constantly improve
performance and increase computational efficiency in their designs to
create value. To this end, it is increasingly common to see
computation accelerators in general-purpose processor
designs. Computation accelerators collapse portions of an
application's dataflow graph, reducing the critical path of
computations, easing the burden on processor resources, and reducing
energy consumption in systems. There are many problems associated with
adding accelerators to microprocessors, though. Design of
accelerators, architectural integration, and software support all
present major challenges.
This dissertation tackles these challenges in the context of
accelerators targeting acyclic and cyclic patterns of
computation. First, a technique to identify critical computation
subgraphs within an application set is presented. This technique is
hardware-cognizant and effectively generates a set of instruction set
extensions given a domain of target applications. Next, several
general-purpose accelerator structures are quantitatively designed
using critical subgraph analysis for a broad application set.
The next challenge is architectural integration of
accelerators. Traditionally, software invokes accelerators by
statically encoding new instructions into the application binary. This
is incredibly costly, though, requiring many portions of hardware and
software to be redesigned. This dissertation develops strategies to
utilize accelerators, without changing the instruction set. In the
proposed approach, the microarchitecture translates applications at
run-time, replacing computation subgraphs with microcode to utilize
accelerators. We explore the tradeoffs in performing difficult aspects
of the translation at compile-time, while retaining run-time
replacement. This culminates in a simple microarchitectural interface
that supports a plug-and-play model for integrating accelerators into
a pre-designed microprocessor.
Software support is the last challenge in dealing with computation
accelerators. The primary issue is difficulty in generating
high-quality code utilizing accelerators. Hand-written assembly code
is standard in industry, and if compiler support does exist, simple
greedy algorithms are common. In this work, we investigate more
thorough techniques for compiling for computation accelerators. Where
greedy heuristics only explore one possible solution, the techniques
in this dissertation explore the entire design space, when
possible. Intelligent pruning methods ensure that compilation is both
tractable and scalable.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57633/2/ntclark_1.pd
Generation of Application Specific Hardware Extensions for Hybrid Architectures: The Development of PIRANHA - A GCC Plugin for High-Level-Synthesis
Architectures combining a field programmable gate array (FPGA) and a general-purpose processor on a single chip became increasingly popular in recent years. On the one hand, such hybrid architectures facilitate the use of application specific hardware accelerators that improve the performance of the software on the host processor. On the other hand, it obliges system designers to handle the whole process of hardware/software co-design. The complexity of this process is still one of the main reasons, that hinders the widespread use of hybrid architectures. Thus, an automated process that aids programmers with the hardware/software partitioning and the generation of application specific accelerators is an important issue. The method presented in this thesis neither requires restrictions of the used high-level-language nor special source code annotations. Usually, this is an entry barrier for programmers without deeper understanding of the underlying hardware platform.
This thesis introduces a seamless programming flow that allows generating hardware accelerators for unrestricted, legacy C code. The implementation consists of a GCC plugin that automatically identifies application hot-spots and generates hardware accelerators accordingly. Apart from the accelerator implementation in a hardware description language, the compiler plugin provides the generation of a host processor interfaces and, if necessary, a prototypical integration with the host operating system. An evaluation with typical embedded applications shows general benefits of the approach, but also reveals limiting factors that hamper possible performance improvements
Description and Specialization of Coarse-grained Reconfigurable Architectures
The functionality of electronic embedded systems, such as mobile phones and digital cameras, becomes more complex at each product generation. This increasing complexity implies great challenges at the design phase of these devices, as designers have to deal with high performance and low energy requirements at a low production budget. In the last years, coarse-grained, dynamically reconfigurable computer systems have increasingly gain in importance as an alternative to cope with these challenges because they provide an optimal trade-off between flexibility-after-production and performance. Like generic purpose processors, coarse-grained reconfigurable systems can be quickly reprogrammed to perform new tasks, but they keep their performance and energy consumption near to ASIC standards. The design of coarse-grained reconfigurable processors is the main theme in this work.
In the first part of this dissertation, I present a new architecture description language that was designed for the description of coarse-grained, reconfigurable systems. This language allows an efficient specification of processor arrays and the description of scalable interconnection networks. The second part of this dissertation investigates the specialization of coarse-grained reconfigurable processors towards an application domain by using custom instruction sets. This work presents methods, techniques, and tools to recognize and extract clusters of operations from a set of application. These clusters serve as patterns for the design of an optimal custom instruction set. Experiments and results are presented, which analyze and assess the impact of custom instructions on coarse-grained processor arrays.Die Funktionalität eingebetteter Systeme wie Mobiltelefone und digitale Foto-Kameras wird zunehmend umfangreicher und bürdet dem Entwurf dieser Geräte hohe Herausforderungen auf, wie z.B. hohe Ausführungsgeschwindigkeit, niedrige Herstellungskosten und geringeren Energieverbrauch. Um diese Herausforderungen zu bewältigen, gewinnen grobgranulare dynamische rekonfigurierbare Rechnersysteme schnell an Bedeutung, denn sie bieten einen optimalen trade-off zwischen Flexibilität nach der Herstellung und Performanz. Wie allgemeine Prozessoren, können grobgranulare rekonfigurierbare Systeme während der Ausführungszeit schnell umprogrammiert werden, um neue Funktionalitäten auszuführen, behalten aber immer noch eine ASIC-ähnliche Performanz und Verlustleistungsverbrauch. Der Entwurf grobgranularer rekonfigurierbarer Bausteine ist das Thema dieser Dissertation.
Im ersten Teil dieser Dissertation wird eine Sprache vorgestellt, die für die Beschreibung grobgranularer rekonfigurierbarer Systeme entwickelt wurde. Diese Sprache ermöglicht eine effiziente Spezifikation von Prozessoren-Arrays und die Beschreibung skalierbarer Netzwerkverbindungen. Der zweite Teil untersucht die Anpassung grobgranularer rekonfigurierbarer Bausteine an Anwendungssätze mittels spezialisierter Befehle. Methoden werden vorgestellt zur Erkennung und Extraktion von Operationsmustern aus einem Anwendungssatz. Diese Operationsmuster dienen dann zum Entwurf eines optimalen spezialisierten Befehlsatzes. Als Ergebnisse werden die Wirkungen von spezialisierten Befehlsätzen in grobgranularen Arrays analysiert und bewertet
Modeling and automated synthesis of reconfigurable interfaces
Stefan IhmorPaderborn, Univ., Diss., 200
Achieving a better balance between productivity and performance on FPGAs through Heterogeneous Extensible Multiprocessor Systems
Field Programmable Gate Arrays (FPGAs) were first introduced circa 1980, and they held the promise of delivering performance levels associated with customized circuits, but with productivity levels more closely associated with software development. Achieving both performance and productivity objectives has been a long standing challenge problem for the reconfigurable computing community and remains unsolved today. On one hand, Vendor supplied design flows have tended towards achieving the high levels of performance through gate level customization, but at the cost of very low productivity. On the other hand, FPGA densities are following Moore\u27s law and and can now support complete multiprocessor system architectures. Thus FPGAs can be turned into an architecture with programmable processors which brings productivity but sacrifices the peak performance advantages of custom circuits. In this thesis we explore how the two use cases can be combined to achieve the best from both.
The flexibility of the FPGAs to host a heterogeneous multiprocessor system with different types of programmable processors and custom accelerators allows the software developers to design a platform that matches the unique performance needs of their application. However, currently no automated approaches are publicly available to create such heterogeneous architectures as well as the software support for these platforms. Creating base architectures, configuring multiple tool chains, and repetitive engineering design efforts can and should be automated. This thesis introduces Heterogeneous Extensible Multiprocessor System (HEMPS) template approach which allows an FPGA to be programmed with productivity levels close to those associated with parallel processing, and with performance levels close to those associated with customized circuits. The work in this thesis introduces an ArchGen script to automate the generation of HEMPS systems as well as a library of portable and self tuning polymorphic functions. These tools will abstract away the HW/SW co-design details and provide a transparent programming language to capture different levels of parallelisms, without sacrificing productivity or portability
Design Space Exploration and Resource Management of Multi/Many-Core Systems
The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends
Increasing the efficacy of automated instruction set extension
The use of Instruction Set Extension (ISE) in customising embedded processors for a specific
application has been studied extensively in recent years. The addition of a set of complex
arithmetic instructions to a baseline core has proven to be a cost-effective means of meeting
design performance requirements. This thesis proposes and evaluates a reconfigurable ISE
implementation called “Configurable Flow Accelerators” (CFAs), a number of refinements to
an existing Automated ISE (AISE) algorithm called “ISEGEN”, and the effects of source form
on AISE.
The CFA is demonstrated repeatedly to be a cost-effective design for ISE implementation.
A temporal partitioning algorithm called “staggering” is proposed and demonstrated on average
to reduce the area of CFA implementation by 37% for only an 8% reduction in acceleration.
This thesis then turns to concerns within the ISEGEN AISE algorithm. A methodology
for finding a good static heuristic weighting vector for ISEGEN is proposed and demonstrated.
Up to 100% of merit is shown to be lost or gained through the choice of vector. ISEGEN
early-termination is introduced and shown to improve the runtime of the algorithm by up to
7.26x, and 5.82x on average. An extension to the ISEGEN heuristic to account for pipelining
is proposed and evaluated, increasing acceleration by up to an additional 1.5x. An energyaware
heuristic is added to ISEGEN, which reduces the energy used by a CFA implementation
of a set of ISEs by an average of 1.6x, up to 3.6x. This result directly contradicts the frequently
espoused notion that “bigger is better” in ISE.
The last stretch of work in this thesis is concerned with source-level transformation: the effect
of changing the representation of the application on the quality of the combined hardwaresoftware
solution. A methodology for combined exploration of source transformation and ISE
is presented, and demonstrated to improve the acceleration of the result by an average of 35%
versus ISE alone. Floating point is demonstrated to perform worse than fixed point, for all
design concerns and applications studied here, regardless of ISEs employed
- …