6 research outputs found
Exploiting FPGA-aware merging of custom instructions for runtime reconfiguration
Runtime reconfiguration is a promising solution for reducing hardware cost in embedded systems, without compromising on performance. We present a framework that aims to increase the performance benefits of reconfigurable processors that support full or partial runtime reconfiguration. The proposed framework achieves this by: (1) providing a means for choosing suitable custom instruction selection heuristics, (2) leveraging FPGA-aware merging of custom instructions to maximize the reconfigurable logic block utilization in each configuration, and (3) incorporating a hierarchical loop partitioning strategy to reduce runtime reconfiguration overhead. We show that the performance gain can be improved by employing suitable custom instruction selection heuristics that, in turn, depend on the reconfigurable resource constraints and the merging factor (extent to which the selected custom instructions can be merged). The hierarchical loop partitioning strategy leads to an average performance gain of over 31% and 46% for full and partial runtime reconfiguration, respectively. Performance gain can be further increased to over 52% and 70% for full and partial runtime reconfiguration, respectively, by exploiting FPGA-aware merging of custom instructions.</jats:p
Using bus-based connections to improve field-programmable gate array density for implementing datapath circuits
Abstract—As the logic capacity of field-programmable gate arrays (FPGAs) increases, they are increasingly being used to implement large arithmetic-intensive applications, which often contain a large proportion of datapath circuits. Since datapath circuits usually consist of regularly structured components (called bitslices) which are connected together by regularly structured signals (called buses), it is possible to utilize datapath regularity in order to achieve significant area savings through FPGA architectural innovations. This paper describes such an FPGA routing architecture, called the multibit routing architecture, which employs busbased connections in order to exploit datapath regularity. It is experimentally shown that, compared to conventional FPGA routing architectures, the multibit routing architecture can achieve 14% routing area reduction for implementing datapath circuits, which represents an overall FPGA area savings of 10%. This paper also empirically determines the best values of several important architectural parameters for the new routing architecture including the most area efficient granularity values and the most area efficient proportion of bus-based connections. Index Terms—Area efficiency, datapath regularity, field-programmable gate arrays (FPGAs), reconfigurable fabric, routing architecture. I
Hybrid FPGA: Architecture and Interface
Hybrid FPGAs (Field Programmable Gate Arrays) are composed of general-purpose logic resources
with different granularities, together with domain-specific coarse-grained units. This thesis proposes
a novel hybrid FPGA architecture with embedded coarse-grained Floating Point Units (FPUs) to
improve the floating point capability of FPGAs. Based on the proposed hybrid FPGA architecture,
we examine three aspects to optimise the speed and area for domain-specific applications.
First, we examine the interface between large coarse-grained embedded blocks (EBs) and fine-grained
elements in hybrid FPGAs. The interface includes parameters for varying: (1) aspect ratio of EBs,
(2) position of the EBs in the FPGA, (3) I/O pins arrangement of EBs, (4) interconnect flexibility of
EBs, and (5) location of additional embedded elements such as memory.
Second, we examine the interconnect structure for hybrid FPGAs. We investigate how large and highdensity
EBs affect the routing demand for hybrid FPGAs over a set of domain-specific applications.
We then propose three routing optimisation methods to meet the additional routing demand introduced
by large EBs: (1) identifying the best separation distance between EBs, (2) adding routing switches on
EBs to increase routing flexibility, and (3) introducing wider channel width near the edge of EBs. We
study and compare the trade-offs in delay, area and routability of these three optimisation methods.
Finally, we employ common subgraph extraction to determine the number of floating point adders/subtractors,
multipliers and wordblocks in the FPUs. The wordblocks include registers and can implement fixed
point operations. We study the area, speed and utilisation trade-offs of the selected FPU subgraphs
in a set of floating point benchmark circuits. We develop an optimised coarse-grained FPU, taking
into account both architectural and system-level issues. Furthermore, we investigate the trade-offs
between granularities and performance by composing small FPUs into a large FPU.
The results of this thesis would help design a domain-specific hybrid FPGA to meet user requirements,
by optimising for speed, area or a combination of speed and area
Metoda projektovanja namenskih programabilnih hardverskih akceleratora
Namenski računarski sistemi se najčesće projektuju tako da mogu da podrže
izvršavanje većeg broja željenih aplikacija. Za postizanje što veće efikasnosti,
preporučuje se korišćenje specijalizovanih procesora Application Specific Instruction
Set Processors–ASIPs, na kojima se izvršavanje programskih instrukcija obavlja u za to
projektovanim i nezavisnimhardverskim blokovima (akceleratorima). Glavni razlog za
postojanje nezavisnih akceleratora jeste postizanjemaksimalnog ubrzanja izvršavanja
instrukcija. Me ¯ dutim, ovakav pristup podrazumeva da je za svaki od blokova potrebno
projektovati integrisano (ASIC) kolo, čime se bitno povećava ukupna površina procesora.
Metod za smanjenje ukupne površine jeste primena DatapathMerging tehnike na
dijagrame toka podataka ulaznih aplikacija. Kao rezultat, dobija se jedan programabilni
hardverski akcelerator, sa mogućnosću izvršavanja svih željenih instrukcija. Međutim,
ovo ima negativne posledice na efikasnost sistema.
često se zanemaruje činjenica da, usled veoma ograničene fleksibilnosti ASIC hardverskih
akceleratora, specijalizovani procesori imaju i drugih nedostataka. Naime, u
slučaju izmena, ili prosto nadogradnje, specifikacije procesora u završnimfazama projektovanja,
neizbežna su velika kašnjenja i dodatni troškovi promene dizajna. U ovoj
tezi je pokazano da zahtevi za fleksibilnošću i efikasnošću ne moraju biti međusobno
isključivi. Demonstrirano je je da je moguce uneti ograničeni nivo fleksibilnosti hardvera
tokom dizajn procesa, tako da dobijeni hardverski akcelerator može da izvršava
ne samo aplikacije definisane na samom početku projektovanja, već i druge aplikacije,
pod uslovom da one pripadaju istom domenu. Drugim rečima, u tezi je prezentovana
metoda projektovanja fleksibilnih namenskih hardverskih akceleratora. Eksperimentalnom evaluacijom pokazano je da su tako dobijeni akceleratori u većini slučajeva
samo do 2 x veće površine ili 2 x većeg kašnjenja od akceleratora dobijenih primenom
DatapathMerging metode, koja pritom ne pruža ni malo dodatne fleksibilnosti.Typically, embedded systems are designed to support a limited set of target
applications. To efficiently execute those applications, they may employ Application
Specific Instruction Set Processors (ASIPs) enriched with carefully designed Instructions
Set Extension (ISEs) implemented in dedicated hardware blocks. The primary goal
when designing ISEs is efficiency, i.e. the highest possible speedup, which implies
synthesizing all critical computational kernels of the application dataflow graphs as
an Application Specific Integrated Circuit (ASICs). Yet, this can lead to high on-chip
area dedicated solely to ISEs. One existing approach to decrease this area by paying
a reasonable price of decreased efficiency is to perform datapath merging on input
dataflow graphs (DFGs) prior to generating the ASIC.
It is often neglected that even higher costs can be accidentally incurred due to the lack
of flexibility of such ISEs. Namely, if late design changes or specification upgrades happen,
significant time-to-market delays and nonrecurrent costs for redesigning the ISEs
and the corresponding ASIPs become inevitable. This thesis shows that flexibility and
efficiency are not mutually exclusive. It demonstrates that it is possible to introduce a
limited amount of hardware flexibility during the design process, such that the resulting
datapath is in fact reconfigurable and thus can execute not only the applications known
at design time, but also other applications belonging to the same application-domain.
In other words, it proposes a methodology for designing domain-specific reconfigurable
arrays out of a limited set of input applications. The experimental results show that
resulting arrays are usually around 2£ larger and 2£ slower than ISEs synthesized using
datapath merging, which have practically null flexibility beyond the design set of DFGs
Circuit design and analysis for on-FPGA communication systems
On-chip communication system has emerged as a prominently important subject in Very-Large-
Scale-Integration (VLSI) design, as the trend of technology scaling favours logics more than interconnects.
Interconnects often dictates the system performance, and, therefore, research for new
methodologies and system architectures that deliver high-performance communication services
across the chip is mandatory. The interconnect challenge is exacerbated in Field-Programmable
Gate Array (FPGA), as a type of ASIC where the hardware can be programmed post-fabrication.
Communication across an FPGA will be deteriorating as a result of interconnect scaling. The programmable
fabrics, switches and the specific routing architecture also introduce additional latency
and bandwidth degradation further hindering intra-chip communication performance.
Past research efforts mainly focused on optimizing logic elements and functional units in FPGAs.
Communication with programmable interconnect received little attention and is inadequately understood.
This thesis is among the first to research on-chip communication systems that are built on
top of programmable fabrics and proposes methodologies to maximize the interconnect throughput
performance. There are three major contributions in this thesis: (i) an analysis of on-chip
interconnect fringing, which degrades the bandwidth of communication channels due to routing
congestions in reconfigurable architectures; (ii) a new analogue wave signalling scheme that significantly
improves the interconnect throughput by exploiting the fundamental electrical characteristics
of the reconfigurable interconnect structures. This new scheme can potentially mitigate
the interconnect scaling challenges. (iii) a novel Dynamic Programming (DP)-network to provide
adaptive routing in network-on-chip (NoC) systems. The DP-network architecture performs runtime
optimization for route planning and dynamic routing which, effectively utilizes the in-silicon
bandwidth. This thesis explores a new horizon in reconfigurable system design, in which new
methodologies and concepts are proposed to enhance the on-FPGA communication throughput
performance that is of vital importance in new technology processes
Synaptic weight modification and storage in hardware neural networks
In 2011 the International Technology Roadmap for Semiconductors, ITRS 2011, outlined how the semiconductor industry should proceed to pursue Moore’s Law past the 18nm generation. It envisioned a concept of ‘More than Moore’, in which existing semiconductor technologies can be exploited to enable the fabrication of diverse systems and in particular systems which integrate non-digital and biologically based functionality. A rapid expansion and growing interest in the fields of microbiology, electrophysiology, and computational neuroscience occurred. This activity has provided significant understanding and insight into the function and structure of the human brain leading to the creation of systems which mimic the operation of the biological nervous system. As the systems expand a need for small area, low power devices which replicate the important biological features of neural networks has been established to implement large scale networks. In this thesis work is presented which focuses on the modification and storage of synaptic weights in hardware neural networks. Test devices were incorporated on 3 chip runs; each chip was fabricated in a 0.35μm process from Austria MicroSystems (AMS) and used for parameter extraction, in accordance with the theoretical analysis presented. A compact circuit is presented which can implement STDP, and has advantages over current implementations in that the critical timing window for synaptic modification is implemented within the circuit. The duration of the critical timing window is set by the subthreshold current controlled by the voltage, Vleak, applied to transistor Mleak in the circuit. A physical model to predict the time window for plasticity to occur is formulated and the effects of process variations on the window is analysed. The STDP circuit is implemented using two dedicated circuit blocks, one for potentiation and one for depression where each block consists of 4 transistors and a polysilicon capacitor, and an area of 980µm2. SpectreS simulations of the back-annotated layout of the circuit and experimental results indicate that STDP with biologically plausible critical timing windows over the range 10µs to 100ms can be implemented. Theoretical analysis using parameters extracted from MOS test devices is used to describe the operation of each device and circuit presented. Simulation results and results obtained from fabricated devices confirm the validity of these designs and approaches. Both the WP and WD circuits have a power consumption of approximately 2.4mW, during a weight update. If no weight update occurs the resting currents within the device are in the nA range, thus each circuit has a power consumption of approximately 1µW. A floating gate, FG, device fabricated using a standard CMOS process is presented. This device is to be integrated with both the WP and WD STDP circuits. The FG device is designed to store negative charge on a FG to represent the synaptic weight of the associated synapse. Charge is added or removed from the FG via Fowler-Nordheim tunnelling. This thesis outlines the design criteria and theoretical operation of this device. A model of the charge storage characteristics is presented and verified using HFCV and PCV experimental results. Limited precision weights, LPW, and its potential use in hardware neural networks is also considered. LPW offers a potential solution in the quest to design a compact FG device for use with CTS. The algorithms presented in this thesis show that LPW allows for a reduction in the synaptic weight storage device while permitting the network to function as intended