3,237 research outputs found
Formal and Informal Methods for Multi-Core Design Space Exploration
We propose a tool-supported methodology for design-space exploration for
embedded systems. It provides means to define high-level models of applications
and multi-processor architectures and evaluate the performance of different
deployment (mapping, scheduling) strategies while taking uncertainty into
account. We argue that this extension of the scope of formal verification is
important for the viability of the domain.Comment: In Proceedings QAPL 2014, arXiv:1406.156
An Efficient Energy Aware Adaptive System-On-Chip Architecture For Real-Time Video Analytics
The video analytics applications which are mostly running on embedded devices have
become prevalent in today’s life. This proliferation has necessitated the development
of System-on-Chips (SoC) to perform utmost processing in a single chip rather than
discrete components. Embedded vision is bounded by stringent requirements, namely
real-time performance, limited energy, and Adaptivity to cope with the standards evolution.
Additionally, to design such complex SoCs, particularly in Zynq All Programmable
SoC, the traditional hardware/software codesign approaches, which rely
on software profiling to perform the hardware/software partitioning, have fallen short
of achieving this task because profiling cannot predict the performance of application
on hardware, thus, a model that relates the application characteristics to the platform
performance is inevitable. Delivering real-time performance for the fast-growing video
resolutions while maintaining the architecture flexibility is non-viable on processors,
Graphic Processing Unit, Digital Signal Processor, and Application Specific Integrated
Circuit. Furthermore, with semiconductor technology scaling, increased power dissipation
is expected; whereas, the battery capacity is not expected to increase significantly.
A Performance model for Zynq is developed using analytical method and used
in hardware/software codesign to facilitate algorithms mapping to hardware. Afterwards,
an SoC for real-time video analytics is realized on Zynq using Harris corner
detection algorithm. A careful analysis of the algorithm and efficient utilization of
Zynq resources results in highly parallelized and pipelined architecture outperforms
the state-of-the-art. Running on a developed energy-aware adaptive SoC and utilizing
dynamic partial reconfiguration, a context-aware configuration scheduler adheres to
operating context and trades off between video resolution and energy consumption to
sustain the uttermost operation time while delivering real-time performance. A realtime
corners detection at 79.8, 176.9, and 504.2 frame per second for HD1080, HD720,
and VGA, respectively, is achieved which outperform the state-of-the-art for HD720
by 31 times and for VGA by 3.5 times. The scheduler configures, at run-time, the
appropriate hardware that satisfies the operating context and user-defined constraints
among the accelerators that are developed for HD1080, HD720, and VGA video standards.
The self-adaptive method achieves 1.77 times longer operation time than a
parametrized IP core for the same battery capacity, with negligible reconfiguration energy
overhead. A marginal effect of reconfiguration time overhead is observed, for
instance, only two video frames are dropped for HD1080p60 during the reconfiguration.
Facilitating the design process by using analytical modeling, and the efficient
utilization of Zynq resources along with self-adaptivity results in an efficient energyaware
SoC that provides real-time performance for video analytics
Performance analysis of a hardware accelerator of dependence management for taskbased dataflow programming models
Along with the popularity of multicore and manycore, task-based dataflow programming models obtain great attention for being able to extract high parallelism from applications without exposing the complexity to programmers. One of these pioneers is the OpenMP Superscalar (OmpSs). By implementing dynamic task dependence analysis, dataflow scheduling and out-of-order execution in runtime, OmpSs achieves high performance using coarse and
medium granularity tasks. In theory, for the same application, the more parallel tasks can be exposed, the higher possible speedup can be achieved. Yet this factor is limited by task granularity, up to a point where the runtime overhead outweighs the performance increase and slows down the application. To overcome this handicap, Picos
was proposed to support task-based dataflow programming models like OmpSs as a fast hardware accelerator for fine-grained task and dependence management, and a simulator was developed to perform design space exploration. This paper presents the very first functional hardware prototype inspired by Picos. An embedded system based on a Zynq 7000 All-Programmable SoC is developed to study its capabilities and possible bottlenecks. Initial scalability and hardware consumption studies of different Picos designs are performed to find the one with the highest performance and lowest hardware cost. A further thorough performance study is employed on both the prototype with the most balanced configuration and the OmpSs software-only alternative. Results show that our OmpSs runtime hardware support significantly outperforms the software-only implementation currently available in the runtime system for finegrained tasks.This work is supported by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project, by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Research Council RoMoL Grant Agreement number 321253. We also thank the Xilinx University Program for its hardware and
software donations.Peer ReviewedPostprint (published version
LEGaTO: first steps towards energy-efficient toolset for heterogeneous computing
LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC.Peer ReviewedPostprint (author's final draft
Predictable migration and communication in the Quest-V multikernal
Quest-V is a system we have been developing from the ground up, with objectives focusing on safety, predictability and efficiency. It is designed to work on emerging multicore processors with hardware virtualization support. Quest-V is implemented as a ``distributed system on a chip'' and comprises multiple sandbox kernels. Sandbox kernels are isolated from one another in separate regions of physical memory, having access to a subset of processing cores and I/O devices. This partitioning prevents system failures in one sandbox affecting the operation of other sandboxes. Shared memory channels managed by system monitors enable inter-sandbox communication.
The distributed nature of Quest-V means each sandbox has a separate physical clock, with all event timings being managed by per-core local timers. Each sandbox is responsible for its own scheduling and I/O management, without requiring intervention of a hypervisor. In this paper, we formulate bounds on inter-sandbox communication in the absence of a global scheduler or global system clock. We also describe how address space migration between sandboxes can be guaranteed without violating service constraints. Experimental results on a working system show the conditions under which Quest-V performs real-time communication and migration.National Science Foundation (1117025
Performance evaluation over HW/SW co-design SoC memory transfers for a CNN accelerator
Many FPGAs vendors have recently included embedded
processors in their devices, like Xilinx with ARM-Cortex
A cores, together with programmable logic cells. These devices
are known as Programmable System on Chip (PSoC). Their ARM
cores (embedded in the processing system or PS) communicates
with the programmable logic cells (PL) using ARM-standard AXI
buses. In this paper we analyses the performance of exhaustive
data transfers between PS and PL for a Xilinx Zynq FPGA
in a co-design real scenario for Convolutional Neural Networks
(CNN) accelerator, which processes, in dedicated hardware, a
stream of visual information from a neuromorphic visual sensor
for classification. In the PS side, a Linux operating system is
running, which recollects visual events from the neuromorphic
sensor into a normalized frame, and then it transfers these
frames to the accelerator of multi-layered CNNs, and read results,
using an AXI-DMA bus in a per-layer way. As these kind of
accelerators try to process information as quick as possible, data
bandwidth becomes critical and maintaining a good balanced
data throughput rate requires some considerations. We present
and evaluate several data partitioning techniques to improve the
balance between RX and TX transfer and two different ways
of transfers management: through a polling routine at the userlevel
of the OS, and through a dedicated interrupt-based kernellevel
driver. We demonstrate that for longer enough packets,
the kernel-level driver solution gets better timing in computing a
CNN classification example. Main advantage of using kernel-level
driver is to have safer solutions and to have tasks scheduling in
the OS to manage other important processes for our application,
like frames collection from sensors and their normalization.Ministerio de Economía y Competitividad TEC2016-77785-
- …