122 research outputs found

    A Comprehensive Workflow for General-Purpose Neural Modeling with Highly Configurable Neuromorphic Hardware Systems

    Full text link
    In this paper we present a methodological framework that meets novel requirements emerging from upcoming types of accelerated and highly configurable neuromorphic hardware systems. We describe in detail a device with 45 million programmable and dynamic synapses that is currently under development, and we sketch the conceptual challenges that arise from taking this platform into operation. More specifically, we aim at the establishment of this neuromorphic system as a flexible and neuroscientifically valuable modeling tool that can be used by non-hardware-experts. We consider various functional aspects to be crucial for this purpose, and we introduce a consistent workflow with detailed descriptions of all involved modules that implement the suggested steps: The integration of the hardware interface into the simulator-independent model description language PyNN; a fully automated translation between the PyNN domain and appropriate hardware configurations; an executable specification of the future neuromorphic system that can be seamlessly integrated into this biology-to-hardware mapping process as a test bench for all software layers and possible hardware design modifications; an evaluation scheme that deploys models from a dedicated benchmark library, compares the results generated by virtual or prototype hardware devices with reference software simulations and analyzes the differences. The integration of these components into one hardware-software workflow provides an ecosystem for ongoing preparative studies that support the hardware design process and represents the basis for the maturity of the model-to-hardware mapping software. The functionality and flexibility of the latter is proven with a variety of experimental results

    Hybrid FPGA: Architecture and Interface

    No full text
    Hybrid FPGAs (Field Programmable Gate Arrays) are composed of general-purpose logic resources with different granularities, together with domain-specific coarse-grained units. This thesis proposes a novel hybrid FPGA architecture with embedded coarse-grained Floating Point Units (FPUs) to improve the floating point capability of FPGAs. Based on the proposed hybrid FPGA architecture, we examine three aspects to optimise the speed and area for domain-specific applications. First, we examine the interface between large coarse-grained embedded blocks (EBs) and fine-grained elements in hybrid FPGAs. The interface includes parameters for varying: (1) aspect ratio of EBs, (2) position of the EBs in the FPGA, (3) I/O pins arrangement of EBs, (4) interconnect flexibility of EBs, and (5) location of additional embedded elements such as memory. Second, we examine the interconnect structure for hybrid FPGAs. We investigate how large and highdensity EBs affect the routing demand for hybrid FPGAs over a set of domain-specific applications. We then propose three routing optimisation methods to meet the additional routing demand introduced by large EBs: (1) identifying the best separation distance between EBs, (2) adding routing switches on EBs to increase routing flexibility, and (3) introducing wider channel width near the edge of EBs. We study and compare the trade-offs in delay, area and routability of these three optimisation methods. Finally, we employ common subgraph extraction to determine the number of floating point adders/subtractors, multipliers and wordblocks in the FPUs. The wordblocks include registers and can implement fixed point operations. We study the area, speed and utilisation trade-offs of the selected FPU subgraphs in a set of floating point benchmark circuits. We develop an optimised coarse-grained FPU, taking into account both architectural and system-level issues. Furthermore, we investigate the trade-offs between granularities and performance by composing small FPUs into a large FPU. The results of this thesis would help design a domain-specific hybrid FPGA to meet user requirements, by optimising for speed, area or a combination of speed and area

    Design and resource management of reconfigurable multiprocessors for data-parallel applications

    Get PDF
    FPGA (Field-Programmable Gate Array)-based custom reconfigurable computing machines have established themselves as low-cost and low-risk alternatives to ASIC (Application-Specific Integrated Circuit) implementations and general-purpose microprocessors in accelerating a wide range of computation-intensive applications. Most often they are Application Specific Programmable Circuiits (ASPCs), which are developer programmable instead of user programmable. The major disadvantages of ASPCs are minimal programmability, and significant time and energy overheads caused by required hardware reconfiguration when the problem size outnumbers the available reconfigurable resources; these problems are expected to become more serious with increases in the FPGA chip size. On the other hand, dominant high-performance computing systems, such as PC clusters and SMPs (Symmetric Multiprocessors), suffer from high communication latencies and/or scalability problems. This research introduces low-cost, user-programmable and reconfigurable MultiProcessor-on-a-Programmable-Chip (MPoPC) systems for high-performance, low-cost computing. It also proposes a relevant resource management framework that deals with performance, power consumption and energy issues. These semi-customized systems reduce significantly runtime device reconfiguration by employing userprogrammable processing elements that are reusable for different tasks in large, complex applications. For the sake of illustration, two different types of MPoPCs with hardware FPUs (floating-point units) are designed and implemented for credible performance evaluation and modeling: the coarse-grain MIMD (Multiple-Instruction, Multiple-Data) CG-MPoPC machine based on a processor IP (Intellectual Property) core and the mixed-mode (MIMD, SIMD or M-SIMD) variant-grain HERA (HEterogeneous Reconfigurable Architecture) machine. In addition to alleviating the above difficulties, MPoPCs can offer several performance and energy advantages to our data-parallel applications when compared to ASPCs; they are simpler and more scalable, and have less verification time and cost. Various common computation-intensive benchmark algorithms, such as matrix-matrix multiplication (MMM) and LU factorization, are studied and their parallel solutions are shown for the two MPoPCs. The performance is evaluated with large sparse real-world matrices primarily from power engineering. We expect even further performance gains on MPoPCs in the near future by employing ever improving FPGAs. The innovative nature of this work has the potential to guide research in this arising field of high-performance, low-cost reconfigurable computing. The largest advantage of reconfigurable logic lies in its large degree of hardware customization and reconfiguration which allows reusing the resources to match the computation and communication needs of applications. Therefore, a major effort in the presented design methodology for mixed-mode MPoPCs, like HERA, is devoted to effective resource management. A two-phase approach is applied. A mixed-mode weighted Task Flow Graph (w-TFG) is first constructed for any given application, where tasks are classified according to their most appropriate computing mode (e.g., SIMD or MIMD). At compile time, an architecture is customized and synthesized for the TFG using an Integer Linear Programming (ILP) formulation and a parameterized hardware component library. Various run-time scheduling schemes with different performanceenergy objectives are proposed. A system-level energy model for HERA, which is based on low-level implementation data and run-time statistics, is proposed to guide performance-energy trade-off decisions. A parallel power flow analysis technique based on Newton\u27s method is proposed and employed to verify the methodology

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    Video Processing Acceleration using Reconfigurable Logic and Graphics Processors

    No full text
    A vexing question is `which architecture will prevail as the core feature of the next state of the art video processing system?' This thesis examines the substitutive and collaborative use of the two alternatives of the reconfigurable logic and graphics processor architectures. A structured approach to executing architecture comparison is presented - this includes a proposed `Three Axes of Algorithm Characterisation' scheme and a formulation of perfor- mance drivers. The approach is an appealing platform for clearly defining the problem, assumptions and results of a comparison. In this work it is used to resolve the advanta- geous factors of the graphics processor and reconfigurable logic for video processing, and the conditions determining which one is superior. The comparison results prompt the exploration of the customisable options for the graphics processor architecture. To clearly define the architectural design space, the graphics processor is first identifed as part of a wider scope of homogeneous multi-processing element (HoMPE) architectures. A novel exploration tool is described which is suited to the investigation of the customisable op- tions of HoMPE architectures. The tool adopts a systematic exploration approach and a high-level parameterisable system model, and is used to explore pre- and post-fabrication customisable options for the graphics processor. A positive result of the exploration is the proposal of a reconfigurable engine for data access (REDA) to optimise graphics processor performance for video processing-specific memory access patterns. REDA demonstrates the viability of the use of reconfigurable logic as collaborative `glue logic' in the graphics processor architecture

    Domain-aware Genetic Algorithms for Hardware and Mapping Optimization for Efficient DNN Acceleration

    Get PDF
    The proliferation of AI across a variety of domains (vision, language, speech, recommendations, games) has led to the rise of domain-specific accelerators for deep learning. At design-time, these accelerators carefully architect the on-chip dataflow to maximize data reuse (over space and time) and size the hardware resources (PEs and buffers) to maximize performance and energy-efficiency, while meeting the chip’s area and power targets. At compile-time, the target Deep Neural Network (DNN) model is mapped over the accelerator. The mapping refers to tiling the computation and data (i.e., tensors) and scheduling them over the PEs and scratchpad buffers respectively, while honoring the microarchitectural constraints (number of PEs, buffer sizes, and dataflow). The design-space of valid hardware resource assignments for a given dataflow and the valid mappings for a given hardware is extremely large (~O(10^24)) per layer for state-of-the-art DNN models today. This makes exhaustive searches infeasible. Unfortunately, there can be orders of magnitude performance and energy-efficiency differences between an optimal and sub-optimal choice, making these decisions a crucial part of the entire design process. Moreover, manual tuning by domain experts become unprecedentedly challenged due to increased irregularity (due to neural architecture search) and sparsity of DNN models. This necessitate the existence of Map Space Exploration (MSE). In this thesis, our goal is to deliver a deep analysis of the MSE for DNN accelerators, propose different techniques to improve MSE, and generalize the MSE framework to a wider landscape (from mapping to HW-mapping co-exploration, from single-accelerator to multi-accelerator scheduling). As part of it, we discuss the correlation between hardware flexibility and the formed map space, formalized the map space representation by four mapping axes: tile, order, parallelism, and shape. Next, we develop dedicated exploration operators for these axes and use genetic algorithm framework to converge the solution. Next, we develop "sparsity-aware" technique to enable sparsity consideration in MSE and a "warm-start" technique to solve the search speed challenge commonly seen across learning-based search algorithms. Finally, we extend out MSE to support hardware and map space co-exploration and multi-accelerator scheduling.Ph.D

    Just-in-time Hardware generation for abstracted reconfigurable computing

    Get PDF
    This thesis addresses the use of reconfigurable hardware in computing platforms, in order to harness the performance benefits of dedicated hardware whilst maintaining the flexibility associated with software. Although the reconfigurable computing concept is not new, the low level nature of the supporting tools normally used, together with the consequent limited level of abstraction and resultant lack of backwards compatibility, has prevented the widespread adoption of this technology. In addition, bandwidth and architectural limitations, have seriously constrained the potential improvements in performance. A review of existing approaches and tools flows is conducted to highlight the current problems being faced in this field. The objective of the work presented in this thesis is to introduce a radically new approach to reconfigurable computing tool flows. The runtime based tool flow introduces complete abstraction between the application developer and the underlying hardware. This new technique eliminates the ease of use and backwards compatibility issues that have plagued the reconfigurable computing concept, and could pave the way for viable mainstream reconfigurable computing platforms. An easy to use, cycle accurate behavioural modelling system is also presented, which was used extensively during the early exploration of new concepts and architectures. Some performance improvements produced by the new reconfigurable computing tool flow, when applied to both a MIPS based embedded platform, and the Cray XDl, are also presented. These results are then analyzed and the hardware and software factors affecting the performance increases that were obtained are discussed, together with potential techniques that could be used to further increase the performance of the system. Lastly a heterogenous computing concept is proposed, in which, a computer system, containing multiple types of computational resource is envisaged, each having their own strengths and weaknesses (e.g. DSPs, CPUs, FPGAs). A revolutionary new method of fully exploiting the potential of such a system, whilst maintaining scalability, backwards compatibility, and ease of use is also presented

    Closing the Gap between FPGA and ASIC:Balancing Flexibility and Efficiency

    Get PDF
    Despite many advantages of Field-Programmable Gate Arrays (FPGAs), they fail to take over the IC design market from Application-Specific Integrated Circuits (ASICs) for high-volume and even medium-volume applications, as FPGAs come with significant cost in area, delay, and power consumption. There are two main reasons that FPGAs have huge efficiency gap with ASICs: (1) FPGAs are extremely flexible as they have fully programmable soft-logic blocks and routing networks, and (2) FPGAs have hard-logic blocks that are only usable by a subset of applications. In other words, current FPGAs have a heterogeneous structure comprised of the flexible soft-logic and the efficient hard-logic blocks that suffer from inefficiency and inflexibility, respectively. The inefficiency of the soft-logic is a challenge for any application that is mapped to FPGAs, and lack of flexibility in the hard-logic results in a waste of resources when an application cannot use the hard-logic. In this thesis, we approach the inefficiency problem of FPGAs by bridging the efficiency/flexibility gap of the hard- and soft-logic. The main goal of this thesis is to compromise on efficiency of the hard-logic for flexibility, on the one hand, and to compromise on flexibility of the soft-logic for efficiency, on the other hand. In other words, this thesis deals with two issues: (1) adding more generality to the hard-logic of FPGAs, and (2) improving the soft-logic by adapting it to the generic requirements of applications. In the first part of the thesis, we introduce new techniques that expand the functionality of FPGAs hard-logic. The hard-logic includes the dedicated resources that are tightly coupled with the soft-logic –i.e., adder circuitry and carry chains –as well as the stand-alone ones –i.e., DSP blocks. These specialized resources are intended to accelerate critical arithmetic operations that appear in the pre-synthesis representation of applications; we introduce mapping and architectural solutions, which enable both types of the hard-logic to support additional arithmetic operations. We first present a mapping technique that extends the application of FPGAs carry chains for carry-save arithmetic, and then to increase the generality of the hard-logic, we introduce novel architectures; using these architectures, more applications can take advantage of FPGAs hard-logic. In the second part of the thesis, we improve the efficiency of FPGAs soft-logic by exploiting the circuit patterns that emerge after logic synthesis, i.e., connection and logic patterns. Using these patterns, we design new soft-logic blocks that have less flexibility, but more efficiency than current ones. In this part, we first introduce logic chains, fixed connections that are integrated between the soft-logic blocks of FPGAs and are well-suited for long chains of logic that appear post-synthesis. Logic chains provide fast and low cost connectivity, increase the bandwidth of the logic blocks without changing their interface with the routing network, and improve the logic density of soft-logic blocks. In addition to logic chains and as a complementary contribution, we present a non-LUT soft-logic block that comprises simple and pre-connected cells. The structure of this logic block is inspired from the logic patterns that appear post-synthesis. This block has a complexity that is only linear in the number of inputs, it sports the potential for multiple independent outputs, and the delay is only logarithmic in the number of inputs. Although this new block is less flexible than a LUT, we show (1) that effective mapping algorithms exist, (2) that, due to their simplicity, poor utilization is less of an issue than with LUTs, and (3) that a few LUTs can still be used in extreme unfortunate cases. In summary, to bridge the gap between FPGAs and ASICs, we approach the problem from two complementary directions, which balance flexibility and efficiency of the logic blocks of FPGAs. However, we were able to explore a few design points in this thesis, and future work could focus on further exploration of the design space

    Metoda projektovanja namenskih programabilnih hardverskih akceleratora

    Get PDF
    Namenski računarski sistemi se najčesće projektuju tako da mogu da podrže izvršavanje većeg broja željenih aplikacija. Za postizanje što veće efikasnosti, preporučuje se korišćenje specijalizovanih procesora Application Specific Instruction Set Processors–ASIPs, na kojima se izvršavanje programskih instrukcija obavlja u za to projektovanim i nezavisnimhardverskim blokovima (akceleratorima). Glavni razlog za postojanje nezavisnih akceleratora jeste postizanjemaksimalnog ubrzanja izvršavanja instrukcija. Me ¯ dutim, ovakav pristup podrazumeva da je za svaki od blokova potrebno projektovati integrisano (ASIC) kolo, čime se bitno povećava ukupna površina procesora. Metod za smanjenje ukupne površine jeste primena DatapathMerging tehnike na dijagrame toka podataka ulaznih aplikacija. Kao rezultat, dobija se jedan programabilni hardverski akcelerator, sa mogućnosću izvršavanja svih željenih instrukcija. Međutim, ovo ima negativne posledice na efikasnost sistema. često se zanemaruje činjenica da, usled veoma ograničene fleksibilnosti ASIC hardverskih akceleratora, specijalizovani procesori imaju i drugih nedostataka. Naime, u slučaju izmena, ili prosto nadogradnje, specifikacije procesora u završnimfazama projektovanja, neizbežna su velika kašnjenja i dodatni troškovi promene dizajna. U ovoj tezi je pokazano da zahtevi za fleksibilnošću i efikasnošću ne moraju biti međusobno isključivi. Demonstrirano je je da je moguce uneti ograničeni nivo fleksibilnosti hardvera tokom dizajn procesa, tako da dobijeni hardverski akcelerator može da izvršava ne samo aplikacije definisane na samom početku projektovanja, već i druge aplikacije, pod uslovom da one pripadaju istom domenu. Drugim rečima, u tezi je prezentovana metoda projektovanja fleksibilnih namenskih hardverskih akceleratora. Eksperimentalnom evaluacijom pokazano je da su tako dobijeni akceleratori u većini slučajeva samo do 2 x veće površine ili 2 x većeg kašnjenja od akceleratora dobijenih primenom DatapathMerging metode, koja pritom ne pruža ni malo dodatne fleksibilnosti.Typically, embedded systems are designed to support a limited set of target applications. To efficiently execute those applications, they may employ Application Specific Instruction Set Processors (ASIPs) enriched with carefully designed Instructions Set Extension (ISEs) implemented in dedicated hardware blocks. The primary goal when designing ISEs is efficiency, i.e. the highest possible speedup, which implies synthesizing all critical computational kernels of the application dataflow graphs as an Application Specific Integrated Circuit (ASICs). Yet, this can lead to high on-chip area dedicated solely to ISEs. One existing approach to decrease this area by paying a reasonable price of decreased efficiency is to perform datapath merging on input dataflow graphs (DFGs) prior to generating the ASIC. It is often neglected that even higher costs can be accidentally incurred due to the lack of flexibility of such ISEs. Namely, if late design changes or specification upgrades happen, significant time-to-market delays and nonrecurrent costs for redesigning the ISEs and the corresponding ASIPs become inevitable. This thesis shows that flexibility and efficiency are not mutually exclusive. It demonstrates that it is possible to introduce a limited amount of hardware flexibility during the design process, such that the resulting datapath is in fact reconfigurable and thus can execute not only the applications known at design time, but also other applications belonging to the same application-domain. In other words, it proposes a methodology for designing domain-specific reconfigurable arrays out of a limited set of input applications. The experimental results show that resulting arrays are usually around 2£ larger and 2£ slower than ISEs synthesized using datapath merging, which have practically null flexibility beyond the design set of DFGs

    Collaborative autonomy in heterogeneous multi-robot systems

    Get PDF
    As autonomous mobile robots become increasingly connected and widely deployed in different domains, managing multiple robots and their interaction is key to the future of ubiquitous autonomous systems. Indeed, robots are not individual entities anymore. Instead, many robots today are deployed as part of larger fleets or in teams. The benefits of multirobot collaboration, specially in heterogeneous groups, are multiple. Significantly higher degrees of situational awareness and understanding of their environment can be achieved when robots with different operational capabilities are deployed together. Examples of this include the Perseverance rover and the Ingenuity helicopter that NASA has deployed in Mars, or the highly heterogeneous robot teams that explored caves and other complex environments during the last DARPA Sub-T competition. This thesis delves into the wide topic of collaborative autonomy in multi-robot systems, encompassing some of the key elements required for achieving robust collaboration: solving collaborative decision-making problems; securing their operation, management and interaction; providing means for autonomous coordination in space and accurate global or relative state estimation; and achieving collaborative situational awareness through distributed perception and cooperative planning. The thesis covers novel formation control algorithms, and new ways to achieve accurate absolute or relative localization within multi-robot systems. It also explores the potential of distributed ledger technologies as an underlying framework to achieve collaborative decision-making in distributed robotic systems. Throughout the thesis, I introduce novel approaches to utilizing cryptographic elements and blockchain technology for securing the operation of autonomous robots, showing that sensor data and mission instructions can be validated in an end-to-end manner. I then shift the focus to localization and coordination, studying ultra-wideband (UWB) radios and their potential. I show how UWB-based ranging and localization can enable aerial robots to operate in GNSS-denied environments, with a study of the constraints and limitations. I also study the potential of UWB-based relative localization between aerial and ground robots for more accurate positioning in areas where GNSS signals degrade. In terms of coordination, I introduce two new algorithms for formation control that require zero to minimal communication, if enough degree of awareness of neighbor robots is available. These algorithms are validated in simulation and real-world experiments. The thesis concludes with the integration of a new approach to cooperative path planning algorithms and UWB-based relative localization for dense scene reconstruction using lidar and vision sensors in ground and aerial robots
    corecore