39 research outputs found

    Logical-to-Physical Memory Mapping for FPGAs with Dual-Port Embedded Arrays

    Full text link

    System-Level Optimization of Accelerator Local Memory for Heterogeneous Systems-on-Chip

    Get PDF
    In modern system-on-chip architectures, specialized accelerators are increasingly used to improve performance and energy efficiency. The growing complexity of these systems requires the use of system-level design methodologies featuring high-level synthesis (HLS) for generating these components efficiently. Existing HLS tools, however, have limited support for the system-level optimization of memory elements, which typically occupy most of the accelerator area. We present a complete methodology for designing the private local memories (PLMs) of multiple accelerators. Based on the memory requirements of each accelerator, our methodology automatically determines an area-efficient architecture for the PLMs to guarantee performance and reduce the memory cost based on technology-related information. We implemented a prototype tool, called Mnemosyne, that embodies our methodology within a commercial HLS flow. We designed 13 complex accelerators for selected applications from two recently-released benchmark suites (Perfect and CortexSuite). With our approach we are able to reduce the memory cost of single accelerators by up to 45%. Moreover, when reusing memory IPs across accelerators, we achieve area savings that range between 17% and 55% compared to the case where the PLMs are designed separately

    Hardware/Software Co-Design via Specification Refinement

    Get PDF
    System-level design is an engineering discipline focused on producing methods, technologies, and tools that enable the specification, design, and implementation of complex, multi-discipline, and multi-domain systems. System-level specifications are as abstract as possible, defining required system behaviors while eliding implementation details. These implementation details must be added during the implementation process and the high effort associated with this locks system engineers onto the chosen implementation architecture. This work provides two contributions that ease the implementation process. The Rosetta synthesis capability generates hardware/software co-designed implementations from specifications that contain low level implementation details. The Rosetta refinement capability extends this by allowing a system's functional behavior and its implementation details to be described separately. The Rosetta Refinement Tool combines the functional behavior and the implementation details to form a system specification that can be synthesized using the Rosetta synthesis capability. The Rosetta refinement capability is exposed using existing Rosetta language constructs that have, previous to this work, never been exploited. Together these two capabilities allow the refinement of high level, architecture independent specifications into low level, architecture specific hardware/software co-designed implementations. The result is an effective platform for rapid prototyping of hardware/software co-designs and provides system engineers with the novel ability to explore different system architectures with low effort

    Address generator synthesis

    Get PDF

    Exploiting All-Programmable System on Chips for Closed-Loop Real-Time Neural Interfaces

    Get PDF
    High-density microelectrode arrays (HDMEAs) feature thousands of recording electrodes in a single chip with an area of few square millimeters. The obtained electrode density is comparable and even higher than the typical density of neuronal cells in cortical cultures. Commercially available HDMEA-based acquisition systems are able to record the neural activity from the whole array at the same time with submillisecond resolution. These devices are a very promising tool and are increasingly used in neuroscience to tackle fundamental questions regarding the complex dynamics of neural networks. Even if electrical or optical stimulation is generally an available feature of such systems, they lack the capability of creating a closed-loop between the biological neural activity and the artificial system. Stimuli are usually sent in an open-loop manner, thus violating the inherent working basis of neural circuits that in nature are constantly reacting to the external environment. This forbids to unravel the real mechanisms behind the behavior of neural networks. The primary objective of this PhD work is to overcome such limitation by creating a fullyreconfigurable processing system capable of providing real-time feedback to the ongoing neural activity recorded with HDMEA platforms. The potentiality of modern heterogeneous FPGAs has been exploited to realize the system. In particular, the Xilinx Zynq All Programmable System on Chip (APSoC) has been used. The device features reconfigurable logic, specialized hardwired blocks, and a dual-core ARM-based processor; the synergy of these components allows to achieve high elaboration performances while maintaining a high level of flexibility and adaptivity. The developed system has been embedded in an acquisition and stimulation setup featuring the following platforms: \u2022 3\ub7Brain BioCam X, a state-of-the-art HDMEA-based acquisition platform capable of recording in parallel from 4096 electrodes at 18 kHz per electrode. \u2022 PlexStim\u2122 Electrical Stimulator System, able to generate electrical stimuli with custom waveforms to 16 different output channels. \u2022 Texas Instruments DLP\uae LightCrafter\u2122 Evaluation Module, capable of projecting 608x684 pixels images with a refresh rate of 60 Hz; it holds the function of optical stimulation. All the features of the system, such as band-pass filtering and spike detection of all the recorded channels, have been validated by means of ex vivo experiments. Very low-latency has been achieved while processing the whole input data stream in real-time. In the case of electrical stimulation the total latency is below 2 ms; when optical stimuli are needed, instead, the total latency is a little higher, being 21 ms in the worst case. The final setup is ready to be used to infer cellular properties by means of closed-loop experiments. As a proof of this concept, it has been successfully used for the clustering and classification of retinal ganglion cells (RGCs) in mice retina. For this experiment, the light-evoked spikes from thousands of RGCs have been correctly recorded and analyzed in real-time. Around 90% of the total clusters have been classified as ON- or OFF-type cells. In addition to the closed-loop system, a denoising prototype has been developed. The main idea is to exploit oversampling techniques to reduce the thermal noise recorded by HDMEAbased acquisition systems. The prototype is capable of processing in real-time all the input signals from the BioCam X, and it is currently being tested to evaluate the performance in terms of signal-to-noise-ratio improvement

    Parallelization of Stochastic Evolution

    Get PDF
    The complexity involved in VLSI design and its sub-problems has always made them ideal application areas for non-eterministic iterative heuristics. However, the major drawback has been the large runtime involved in reaching acceptable solutions especially in the case of multi-objective optimization problems. Among the acceleration techniques proposed, parallelization of iterative heuristics is a promising one. The motivation for Parallel CAD include faster runtimes, handling of larger problem sizes, and exploration of larger search space. In this work, the development of parallel algorithms for Stochastic Evolution, applied on multi-objective VLSI cell-placement problem is presented. In VLSI circuit design, placement is the process of arranging circuit blocks on a layout. In standard cell design, placement consists of determining optimum positions of all blocks on the layout to satisfy the constraint and improve a number of objectives. The placement objectives in our work are to reduce power dissipation and wire-length while improving performance (timing). The parallelization is achieved on a cluster of workstations interconnected by a low-latency network, by using MPI communication libraries. Circuits from ISCAS-89 are used as benchmarks. Results for parallel Stochastic Evolution are compared with its sequential counterpart as well as with the results achieved by parallel versions of Simulated Annealing as a reference point for both, the quality of solution as well as the execution time. After parallelization, linear and super linear speed-ups were obtained, with no degradation in quality of the solution

    Parallelization of Stochastic Evolution

    Get PDF
    The complexity involved in VLSI design and its sub-problems has always made them ideal application areas for non-eterministic iterative heuristics. However, the major drawback has been the large runtime involved in reaching acceptable solutions especially in the case of multi-objective optimization problems. Among the acceleration techniques proposed, parallelization of iterative heuristics is a promising one. The motivation for Parallel CAD include faster runtimes, handling of larger problem sizes, and exploration of larger search space. In this work, the development of parallel algorithms for Stochastic Evolution, applied on multi-objective VLSI cell-placement problem is presented. In VLSI circuit design, placement is the process of arranging circuit blocks on a layout. In standard cell design, placement consists of determining optimum positions of all blocks on the layout to satisfy the constraint and improve a number of objectives. The placement objectives in our work are to reduce power dissipation and wire-length while improving performance (timing). The parallelization is achieved on a cluster of workstations interconnected by a low-latency network, by using MPI communication libraries. Circuits from ISCAS-89 are used as benchmarks. Results for parallel Stochastic Evolution are compared with its sequential counterpart as well as with the results achieved by parallel versions of Simulated Annealing as a reference point for both, the quality of solution as well as the execution time. After parallelization, linear and super linear speed-ups were obtained, with no degradation in quality of the solution

    Generation of Application Specific Hardware Extensions for Hybrid Architectures: The Development of PIRANHA - A GCC Plugin for High-Level-Synthesis

    Get PDF
    Architectures combining a field programmable gate array (FPGA) and a general-purpose processor on a single chip became increasingly popular in recent years. On the one hand, such hybrid architectures facilitate the use of application specific hardware accelerators that improve the performance of the software on the host processor. On the other hand, it obliges system designers to handle the whole process of hardware/software co-design. The complexity of this process is still one of the main reasons, that hinders the widespread use of hybrid architectures. Thus, an automated process that aids programmers with the hardware/software partitioning and the generation of application specific accelerators is an important issue. The method presented in this thesis neither requires restrictions of the used high-level-language nor special source code annotations. Usually, this is an entry barrier for programmers without deeper understanding of the underlying hardware platform. This thesis introduces a seamless programming flow that allows generating hardware accelerators for unrestricted, legacy C code. The implementation consists of a GCC plugin that automatically identifies application hot-spots and generates hardware accelerators accordingly. Apart from the accelerator implementation in a hardware description language, the compiler plugin provides the generation of a host processor interfaces and, if necessary, a prototypical integration with the host operating system. An evaluation with typical embedded applications shows general benefits of the approach, but also reveals limiting factors that hamper possible performance improvements
    corecore