39 research outputs found
System-Level Optimization of Accelerator Local Memory for Heterogeneous Systems-on-Chip
In modern system-on-chip architectures, specialized accelerators are increasingly used to improve performance and energy efficiency. The growing complexity of these systems requires the use of system-level design methodologies featuring high-level synthesis (HLS) for generating these components efficiently. Existing HLS tools, however, have limited support for the system-level optimization of memory elements, which typically occupy most of the accelerator area. We present a complete methodology for designing the private local memories (PLMs) of multiple accelerators. Based on the memory requirements of each accelerator, our methodology automatically determines an area-efficient architecture for the PLMs to guarantee performance and reduce the memory cost based on technology-related information. We implemented a prototype tool, called Mnemosyne, that embodies our methodology within a commercial HLS flow. We designed 13 complex accelerators for selected applications from two recently-released benchmark suites (Perfect and CortexSuite). With our approach we are able to reduce the memory cost of single accelerators by up to 45%. Moreover, when reusing memory IPs across accelerators, we achieve area savings that range between 17% and 55% compared to the case where the PLMs are designed separately
Hardware/Software Co-Design via Specification Refinement
System-level design is an engineering discipline focused on producing methods, technologies, and tools that enable the specification, design, and implementation of complex, multi-discipline, and multi-domain systems. System-level specifications are as abstract as possible, defining required system behaviors while eliding implementation details. These implementation details must be added during the implementation process and the high effort associated with this locks system engineers onto the chosen implementation architecture. This work provides two contributions that ease the implementation process. The Rosetta synthesis capability generates hardware/software co-designed implementations from specifications that contain low level implementation details. The Rosetta refinement capability extends this by allowing a system's functional behavior and its implementation details to be described separately. The Rosetta Refinement Tool combines the functional behavior and the implementation details to form a system specification that can be synthesized using the Rosetta synthesis capability. The Rosetta refinement capability is exposed using existing Rosetta language constructs that have, previous to this work, never been exploited. Together these two capabilities allow the refinement of high level, architecture independent specifications into low level, architecture specific hardware/software co-designed implementations. The result is an effective platform for rapid prototyping of hardware/software co-designs and provides system engineers with the novel ability to explore different system architectures with low effort
Exploiting All-Programmable System on Chips for Closed-Loop Real-Time Neural Interfaces
High-density microelectrode arrays (HDMEAs) feature thousands of recording electrodes
in a single chip with an area of few square millimeters. The obtained electrode density is
comparable and even higher than the typical density of neuronal cells in cortical cultures.
Commercially available HDMEA-based acquisition systems are able to record the neural
activity from the whole array at the same time with submillisecond resolution. These devices
are a very promising tool and are increasingly used in neuroscience to tackle fundamental
questions regarding the complex dynamics of neural networks. Even if electrical or optical
stimulation is generally an available feature of such systems, they lack the capability of
creating a closed-loop between the biological neural activity and the artificial system. Stimuli
are usually sent in an open-loop manner, thus violating the inherent working basis of neural
circuits that in nature are constantly reacting to the external environment. This forbids to
unravel the real mechanisms behind the behavior of neural networks.
The primary objective of this PhD work is to overcome such limitation by creating a fullyreconfigurable
processing system capable of providing real-time feedback to the ongoing
neural activity recorded with HDMEA platforms. The potentiality of modern heterogeneous
FPGAs has been exploited to realize the system. In particular, the Xilinx Zynq All Programmable
System on Chip (APSoC) has been used. The device features reconfigurable
logic, specialized hardwired blocks, and a dual-core ARM-based processor; the synergy of
these components allows to achieve high elaboration performances while maintaining a high
level of flexibility and adaptivity. The developed system has been embedded in an acquisition
and stimulation setup featuring the following platforms:
\u2022 3\ub7Brain BioCam X, a state-of-the-art HDMEA-based acquisition platform capable of
recording in parallel from 4096 electrodes at 18 kHz per electrode.
\u2022 PlexStim\u2122 Electrical Stimulator System, able to generate electrical stimuli with
custom waveforms to 16 different output channels.
\u2022 Texas Instruments DLP\uae LightCrafter\u2122 Evaluation Module, capable of projecting
608x684 pixels images with a refresh rate of 60 Hz; it holds the function of optical
stimulation.
All the features of the system, such as band-pass filtering and spike detection of all the
recorded channels, have been validated by means of ex vivo experiments. Very low-latency
has been achieved while processing the whole input data stream in real-time. In the case
of electrical stimulation the total latency is below 2 ms; when optical stimuli are needed,
instead, the total latency is a little higher, being 21 ms in the worst case.
The final setup is ready to be used to infer cellular properties by means of closed-loop
experiments. As a proof of this concept, it has been successfully used for the clustering
and classification of retinal ganglion cells (RGCs) in mice retina. For this experiment, the
light-evoked spikes from thousands of RGCs have been correctly recorded and analyzed in
real-time. Around 90% of the total clusters have been classified as ON- or OFF-type cells.
In addition to the closed-loop system, a denoising prototype has been developed. The main
idea is to exploit oversampling techniques to reduce the thermal noise recorded by HDMEAbased
acquisition systems. The prototype is capable of processing in real-time all the input
signals from the BioCam X, and it is currently being tested to evaluate the performance in
terms of signal-to-noise-ratio improvement
Parallelization of Stochastic Evolution
The complexity involved in VLSI design and its sub-problems has always made them ideal application areas for non-eterministic iterative heuristics. However, the major drawback has been the large runtime involved in reaching acceptable solutions especially in the case of multi-objective optimization problems. Among the acceleration techniques proposed, parallelization of iterative heuristics is a promising one. The motivation for Parallel CAD include faster runtimes, handling of larger problem sizes, and exploration of larger search space. In this work, the development of parallel algorithms for Stochastic Evolution, applied on multi-objective VLSI cell-placement problem is presented. In VLSI circuit design, placement is the process of arranging circuit blocks on a layout. In standard cell design, placement consists of determining optimum positions of all blocks on the layout to satisfy the constraint and improve a number of objectives. The placement objectives in our work are to reduce power dissipation and wire-length while improving performance (timing). The parallelization is achieved on a cluster of workstations interconnected by a low-latency network, by using MPI communication libraries. Circuits from ISCAS-89 are used as benchmarks. Results for parallel Stochastic Evolution are compared with its sequential counterpart as well as with the results achieved by parallel versions of Simulated Annealing as a reference point for both, the quality of solution as well as the execution time. After parallelization, linear and super linear speed-ups were obtained, with no degradation in quality of the solution
Parallelization of Stochastic Evolution
The complexity involved in VLSI design and its sub-problems has always made them ideal application areas for non-eterministic iterative heuristics. However, the major drawback has been the large runtime involved in reaching acceptable solutions especially in the case of multi-objective optimization problems. Among the acceleration techniques proposed, parallelization of iterative heuristics is a promising one. The motivation for Parallel CAD include faster runtimes, handling of larger problem sizes, and exploration of larger search space. In this work, the development of parallel algorithms for Stochastic Evolution, applied on multi-objective VLSI cell-placement problem is presented. In VLSI circuit design, placement is the process of arranging circuit blocks on a layout. In standard cell design, placement consists of determining optimum positions of all blocks on the layout to satisfy the constraint and improve a number of objectives. The placement objectives in our work are to reduce power dissipation and wire-length while improving performance (timing). The parallelization is achieved on a cluster of workstations interconnected by a low-latency network, by using MPI communication libraries. Circuits from ISCAS-89 are used as benchmarks. Results for parallel Stochastic Evolution are compared with its sequential counterpart as well as with the results achieved by parallel versions of Simulated Annealing as a reference point for both, the quality of solution as well as the execution time. After parallelization, linear and super linear speed-ups were obtained, with no degradation in quality of the solution
Generation of Application Specific Hardware Extensions for Hybrid Architectures: The Development of PIRANHA - A GCC Plugin for High-Level-Synthesis
Architectures combining a field programmable gate array (FPGA) and a general-purpose processor on a single chip became increasingly popular in recent years. On the one hand, such hybrid architectures facilitate the use of application specific hardware accelerators that improve the performance of the software on the host processor. On the other hand, it obliges system designers to handle the whole process of hardware/software co-design. The complexity of this process is still one of the main reasons, that hinders the widespread use of hybrid architectures. Thus, an automated process that aids programmers with the hardware/software partitioning and the generation of application specific accelerators is an important issue. The method presented in this thesis neither requires restrictions of the used high-level-language nor special source code annotations. Usually, this is an entry barrier for programmers without deeper understanding of the underlying hardware platform.
This thesis introduces a seamless programming flow that allows generating hardware accelerators for unrestricted, legacy C code. The implementation consists of a GCC plugin that automatically identifies application hot-spots and generates hardware accelerators accordingly. Apart from the accelerator implementation in a hardware description language, the compiler plugin provides the generation of a host processor interfaces and, if necessary, a prototypical integration with the host operating system. An evaluation with typical embedded applications shows general benefits of the approach, but also reveals limiting factors that hamper possible performance improvements