6 research outputs found

    Exploiting FPGA-aware merging of custom instructions for runtime reconfiguration

    Get PDF
    Runtime reconfiguration is a promising solution for reducing hardware cost in embedded systems, without compromising on performance. We present a framework that aims to increase the performance benefits of reconfigurable processors that support full or partial runtime reconfiguration. The proposed framework achieves this by: (1) providing a means for choosing suitable custom instruction selection heuristics, (2) leveraging FPGA-aware merging of custom instructions to maximize the reconfigurable logic block utilization in each configuration, and (3) incorporating a hierarchical loop partitioning strategy to reduce runtime reconfiguration overhead. We show that the performance gain can be improved by employing suitable custom instruction selection heuristics that, in turn, depend on the reconfigurable resource constraints and the merging factor (extent to which the selected custom instructions can be merged). The hierarchical loop partitioning strategy leads to an average performance gain of over 31% and 46% for full and partial runtime reconfiguration, respectively. Performance gain can be further increased to over 52% and 70% for full and partial runtime reconfiguration, respectively, by exploiting FPGA-aware merging of custom instructions.</jats:p

    Using bus-based connections to improve field-programmable gate array density for implementing datapath circuits

    No full text
    Abstract—As the logic capacity of field-programmable gate arrays (FPGAs) increases, they are increasingly being used to implement large arithmetic-intensive applications, which often contain a large proportion of datapath circuits. Since datapath circuits usually consist of regularly structured components (called bitslices) which are connected together by regularly structured signals (called buses), it is possible to utilize datapath regularity in order to achieve significant area savings through FPGA architectural innovations. This paper describes such an FPGA routing architecture, called the multibit routing architecture, which employs busbased connections in order to exploit datapath regularity. It is experimentally shown that, compared to conventional FPGA routing architectures, the multibit routing architecture can achieve 14% routing area reduction for implementing datapath circuits, which represents an overall FPGA area savings of 10%. This paper also empirically determines the best values of several important architectural parameters for the new routing architecture including the most area efficient granularity values and the most area efficient proportion of bus-based connections. Index Terms—Area efficiency, datapath regularity, field-programmable gate arrays (FPGAs), reconfigurable fabric, routing architecture. I

    Hybrid FPGA: Architecture and Interface

    No full text
    Hybrid FPGAs (Field Programmable Gate Arrays) are composed of general-purpose logic resources with different granularities, together with domain-specific coarse-grained units. This thesis proposes a novel hybrid FPGA architecture with embedded coarse-grained Floating Point Units (FPUs) to improve the floating point capability of FPGAs. Based on the proposed hybrid FPGA architecture, we examine three aspects to optimise the speed and area for domain-specific applications. First, we examine the interface between large coarse-grained embedded blocks (EBs) and fine-grained elements in hybrid FPGAs. The interface includes parameters for varying: (1) aspect ratio of EBs, (2) position of the EBs in the FPGA, (3) I/O pins arrangement of EBs, (4) interconnect flexibility of EBs, and (5) location of additional embedded elements such as memory. Second, we examine the interconnect structure for hybrid FPGAs. We investigate how large and highdensity EBs affect the routing demand for hybrid FPGAs over a set of domain-specific applications. We then propose three routing optimisation methods to meet the additional routing demand introduced by large EBs: (1) identifying the best separation distance between EBs, (2) adding routing switches on EBs to increase routing flexibility, and (3) introducing wider channel width near the edge of EBs. We study and compare the trade-offs in delay, area and routability of these three optimisation methods. Finally, we employ common subgraph extraction to determine the number of floating point adders/subtractors, multipliers and wordblocks in the FPUs. The wordblocks include registers and can implement fixed point operations. We study the area, speed and utilisation trade-offs of the selected FPU subgraphs in a set of floating point benchmark circuits. We develop an optimised coarse-grained FPU, taking into account both architectural and system-level issues. Furthermore, we investigate the trade-offs between granularities and performance by composing small FPUs into a large FPU. The results of this thesis would help design a domain-specific hybrid FPGA to meet user requirements, by optimising for speed, area or a combination of speed and area

    Metoda projektovanja namenskih programabilnih hardverskih akceleratora

    Get PDF
    Namenski računarski sistemi se najčesće projektuju tako da mogu da podrže izvršavanje većeg broja željenih aplikacija. Za postizanje što veće efikasnosti, preporučuje se korišćenje specijalizovanih procesora Application Specific Instruction Set Processors–ASIPs, na kojima se izvršavanje programskih instrukcija obavlja u za to projektovanim i nezavisnimhardverskim blokovima (akceleratorima). Glavni razlog za postojanje nezavisnih akceleratora jeste postizanjemaksimalnog ubrzanja izvršavanja instrukcija. Me ¯ dutim, ovakav pristup podrazumeva da je za svaki od blokova potrebno projektovati integrisano (ASIC) kolo, čime se bitno povećava ukupna površina procesora. Metod za smanjenje ukupne površine jeste primena DatapathMerging tehnike na dijagrame toka podataka ulaznih aplikacija. Kao rezultat, dobija se jedan programabilni hardverski akcelerator, sa mogućnosću izvršavanja svih željenih instrukcija. Međutim, ovo ima negativne posledice na efikasnost sistema. često se zanemaruje činjenica da, usled veoma ograničene fleksibilnosti ASIC hardverskih akceleratora, specijalizovani procesori imaju i drugih nedostataka. Naime, u slučaju izmena, ili prosto nadogradnje, specifikacije procesora u završnimfazama projektovanja, neizbežna su velika kašnjenja i dodatni troškovi promene dizajna. U ovoj tezi je pokazano da zahtevi za fleksibilnošću i efikasnošću ne moraju biti međusobno isključivi. Demonstrirano je je da je moguce uneti ograničeni nivo fleksibilnosti hardvera tokom dizajn procesa, tako da dobijeni hardverski akcelerator može da izvršava ne samo aplikacije definisane na samom početku projektovanja, već i druge aplikacije, pod uslovom da one pripadaju istom domenu. Drugim rečima, u tezi je prezentovana metoda projektovanja fleksibilnih namenskih hardverskih akceleratora. Eksperimentalnom evaluacijom pokazano je da su tako dobijeni akceleratori u većini slučajeva samo do 2 x veće površine ili 2 x većeg kašnjenja od akceleratora dobijenih primenom DatapathMerging metode, koja pritom ne pruža ni malo dodatne fleksibilnosti.Typically, embedded systems are designed to support a limited set of target applications. To efficiently execute those applications, they may employ Application Specific Instruction Set Processors (ASIPs) enriched with carefully designed Instructions Set Extension (ISEs) implemented in dedicated hardware blocks. The primary goal when designing ISEs is efficiency, i.e. the highest possible speedup, which implies synthesizing all critical computational kernels of the application dataflow graphs as an Application Specific Integrated Circuit (ASICs). Yet, this can lead to high on-chip area dedicated solely to ISEs. One existing approach to decrease this area by paying a reasonable price of decreased efficiency is to perform datapath merging on input dataflow graphs (DFGs) prior to generating the ASIC. It is often neglected that even higher costs can be accidentally incurred due to the lack of flexibility of such ISEs. Namely, if late design changes or specification upgrades happen, significant time-to-market delays and nonrecurrent costs for redesigning the ISEs and the corresponding ASIPs become inevitable. This thesis shows that flexibility and efficiency are not mutually exclusive. It demonstrates that it is possible to introduce a limited amount of hardware flexibility during the design process, such that the resulting datapath is in fact reconfigurable and thus can execute not only the applications known at design time, but also other applications belonging to the same application-domain. In other words, it proposes a methodology for designing domain-specific reconfigurable arrays out of a limited set of input applications. The experimental results show that resulting arrays are usually around 2£ larger and 2£ slower than ISEs synthesized using datapath merging, which have practically null flexibility beyond the design set of DFGs

    Circuit design and analysis for on-FPGA communication systems

    No full text
    On-chip communication system has emerged as a prominently important subject in Very-Large- Scale-Integration (VLSI) design, as the trend of technology scaling favours logics more than interconnects. Interconnects often dictates the system performance, and, therefore, research for new methodologies and system architectures that deliver high-performance communication services across the chip is mandatory. The interconnect challenge is exacerbated in Field-Programmable Gate Array (FPGA), as a type of ASIC where the hardware can be programmed post-fabrication. Communication across an FPGA will be deteriorating as a result of interconnect scaling. The programmable fabrics, switches and the specific routing architecture also introduce additional latency and bandwidth degradation further hindering intra-chip communication performance. Past research efforts mainly focused on optimizing logic elements and functional units in FPGAs. Communication with programmable interconnect received little attention and is inadequately understood. This thesis is among the first to research on-chip communication systems that are built on top of programmable fabrics and proposes methodologies to maximize the interconnect throughput performance. There are three major contributions in this thesis: (i) an analysis of on-chip interconnect fringing, which degrades the bandwidth of communication channels due to routing congestions in reconfigurable architectures; (ii) a new analogue wave signalling scheme that significantly improves the interconnect throughput by exploiting the fundamental electrical characteristics of the reconfigurable interconnect structures. This new scheme can potentially mitigate the interconnect scaling challenges. (iii) a novel Dynamic Programming (DP)-network to provide adaptive routing in network-on-chip (NoC) systems. The DP-network architecture performs runtime optimization for route planning and dynamic routing which, effectively utilizes the in-silicon bandwidth. This thesis explores a new horizon in reconfigurable system design, in which new methodologies and concepts are proposed to enhance the on-FPGA communication throughput performance that is of vital importance in new technology processes

    Synaptic weight modification and storage in hardware neural networks

    Get PDF
    In 2011 the International Technology Roadmap for Semiconductors, ITRS 2011, outlined how the semiconductor industry should proceed to pursue Moore’s Law past the 18nm generation. It envisioned a concept of ‘More than Moore’, in which existing semiconductor technologies can be exploited to enable the fabrication of diverse systems and in particular systems which integrate non-digital and biologically based functionality. A rapid expansion and growing interest in the fields of microbiology, electrophysiology, and computational neuroscience occurred. This activity has provided significant understanding and insight into the function and structure of the human brain leading to the creation of systems which mimic the operation of the biological nervous system. As the systems expand a need for small area, low power devices which replicate the important biological features of neural networks has been established to implement large scale networks. In this thesis work is presented which focuses on the modification and storage of synaptic weights in hardware neural networks. Test devices were incorporated on 3 chip runs; each chip was fabricated in a 0.35μm process from Austria MicroSystems (AMS) and used for parameter extraction, in accordance with the theoretical analysis presented. A compact circuit is presented which can implement STDP, and has advantages over current implementations in that the critical timing window for synaptic modification is implemented within the circuit. The duration of the critical timing window is set by the subthreshold current controlled by the voltage, Vleak, applied to transistor Mleak in the circuit. A physical model to predict the time window for plasticity to occur is formulated and the effects of process variations on the window is analysed. The STDP circuit is implemented using two dedicated circuit blocks, one for potentiation and one for depression where each block consists of 4 transistors and a polysilicon capacitor, and an area of 980µm2. SpectreS simulations of the back-annotated layout of the circuit and experimental results indicate that STDP with biologically plausible critical timing windows over the range 10µs to 100ms can be implemented. Theoretical analysis using parameters extracted from MOS test devices is used to describe the operation of each device and circuit presented. Simulation results and results obtained from fabricated devices confirm the validity of these designs and approaches. Both the WP and WD circuits have a power consumption of approximately 2.4mW, during a weight update. If no weight update occurs the resting currents within the device are in the nA range, thus each circuit has a power consumption of approximately 1µW. A floating gate, FG, device fabricated using a standard CMOS process is presented. This device is to be integrated with both the WP and WD STDP circuits. The FG device is designed to store negative charge on a FG to represent the synaptic weight of the associated synapse. Charge is added or removed from the FG via Fowler-Nordheim tunnelling. This thesis outlines the design criteria and theoretical operation of this device. A model of the charge storage characteristics is presented and verified using HFCV and PCV experimental results. Limited precision weights, LPW, and its potential use in hardware neural networks is also considered. LPW offers a potential solution in the quest to design a compact FG device for use with CTS. The algorithms presented in this thesis show that LPW allows for a reduction in the synaptic weight storage device while permitting the network to function as intended
    corecore