Search CORE

3,237 research outputs found

Formal and Informal Methods for Multi-Core Design Space Exploration

Author: Kempf Jean-Francois
Lebeltel Olivier
Maler Oded
Publication venue: 'Open Publishing Association'
Publication date: 01/06/2014
Field of study

We propose a tool-supported methodology for design-space exploration for embedded systems. It provides means to define high-level models of applications and multi-processor architectures and evaluate the performance of different deployment (mapping, scheduling) strategies while taking uncertainty into account. We argue that this extension of the scope of formal verification is important for the viability of the domain.Comment: In Proceedings QAPL 2014, arXiv:1406.156

arXiv.org e-Print Archive

Directory of Open Access Journals

An Efficient Energy Aware Adaptive System-On-Chip Architecture For Real-Time Video Analytics

Author: Ahmed Hisham Ahmed Ali
Publication venue
Publication date: 01/10/2016
Field of study

The video analytics applications which are mostly running on embedded devices have become prevalent in today’s life. This proliferation has necessitated the development of System-on-Chips (SoC) to perform utmost processing in a single chip rather than discrete components. Embedded vision is bounded by stringent requirements, namely real-time performance, limited energy, and Adaptivity to cope with the standards evolution. Additionally, to design such complex SoCs, particularly in Zynq All Programmable SoC, the traditional hardware/software codesign approaches, which rely on software profiling to perform the hardware/software partitioning, have fallen short of achieving this task because profiling cannot predict the performance of application on hardware, thus, a model that relates the application characteristics to the platform performance is inevitable. Delivering real-time performance for the fast-growing video resolutions while maintaining the architecture flexibility is non-viable on processors, Graphic Processing Unit, Digital Signal Processor, and Application Specific Integrated Circuit. Furthermore, with semiconductor technology scaling, increased power dissipation is expected; whereas, the battery capacity is not expected to increase significantly. A Performance model for Zynq is developed using analytical method and used in hardware/software codesign to facilitate algorithms mapping to hardware. Afterwards, an SoC for real-time video analytics is realized on Zynq using Harris corner detection algorithm. A careful analysis of the algorithm and efficient utilization of Zynq resources results in highly parallelized and pipelined architecture outperforms the state-of-the-art. Running on a developed energy-aware adaptive SoC and utilizing dynamic partial reconfiguration, a context-aware configuration scheduler adheres to operating context and trades off between video resolution and energy consumption to sustain the uttermost operation time while delivering real-time performance. A realtime corners detection at 79.8, 176.9, and 504.2 frame per second for HD1080, HD720, and VGA, respectively, is achieved which outperform the state-of-the-art for HD720 by 31 times and for VGA by 3.5 times. The scheduler configures, at run-time, the appropriate hardware that satisfies the operating context and user-defined constraints among the accelerators that are developed for HD1080, HD720, and VGA video standards. The self-adaptive method achieves 1.77 times longer operation time than a parametrized IP core for the same battery capacity, with negligible reconfiguration energy overhead. A marginal effect of reconfiguration time overhead is observed, for instance, only two video frames are dropped for HD1080p60 during the reconfiguration. Facilitating the design process by using analytical modeling, and the efficient utilization of Zynq resources along with self-adaptivity results in an efficient energyaware SoC that provides real-time performance for video analytics

Repository@USM

Performance analysis of a hardware accelerator of dependence management for taskbased dataflow programming models

Author: Ayguadé Parra Eduard
Bosch Pons Jaume
Jiménez González Daniel
Tan Xubin
Valero Cortés Mateo
Álvarez Martínez Carlos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Along with the popularity of multicore and manycore, task-based dataflow programming models obtain great attention for being able to extract high parallelism from applications without exposing the complexity to programmers. One of these pioneers is the OpenMP Superscalar (OmpSs). By implementing dynamic task dependence analysis, dataflow scheduling and out-of-order execution in runtime, OmpSs achieves high performance using coarse and medium granularity tasks. In theory, for the same application, the more parallel tasks can be exposed, the higher possible speedup can be achieved. Yet this factor is limited by task granularity, up to a point where the runtime overhead outweighs the performance increase and slows down the application. To overcome this handicap, Picos was proposed to support task-based dataflow programming models like OmpSs as a fast hardware accelerator for fine-grained task and dependence management, and a simulator was developed to perform design space exploration. This paper presents the very first functional hardware prototype inspired by Picos. An embedded system based on a Zynq 7000 All-Programmable SoC is developed to study its capabilities and possible bottlenecks. Initial scalability and hardware consumption studies of different Picos designs are performed to find the one with the highest performance and lowest hardware cost. A further thorough performance study is employed on both the prototype with the most balanced configuration and the OmpSs software-only alternative. Results show that our OmpSs runtime hardware support significantly outperforms the software-only implementation currently available in the runtime system for finegrained tasks.This work is supported by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project, by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Research Council RoMoL Grant Agreement number 321253. We also thank the Xilinx University Program for its hardware and software donations.Peer ReviewedPostprint (published version

LEGaTO: first steps towards energy-efficient toolset for heterogeneous computing

Author: Alvarez Carlos
Bautista Leonardo
Becker Tobias
Billung-Meyer Gunnar
Carpenter Paul
Christmann Wolfgang
Cristal Adrian
De La Cruz Raul
Dubhashi Devdatt
Etsion Yoav
Felber Pascal
Fetzer Christof
Gaydadjiev Georgi
Göttel Christian
Hadar Elad
Hagemeyer Jens
Jimenez Daniel
Jungeblut Thorsten
Kaiser Martin
Klawonn Frank
Krupop Stefan
Kucza Nils
Madonar Sergi
Martorell Xavier
Mihklafi Amani
Mudge Trevor
Mudge Trevor
Pasin Marcelo
Pericàs Miquel
Pnevmatikatos Dionisios N.
Porrmann Mario
Port Oron
Rocha Isabelly
Salami Behzad
Salomonsson Hans
Schiavoni Valerio
Trancoso Pedro
Unsal Osman S.
vor dem Berge Micha
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC.Peer ReviewedPostprint (author's final draft

Chalmers Research

Publications at Bielefeld University

Predictable migration and communication in the Quest-V multikernal

Author: Li Ye
Missimer Eric
West Richard
Publication venue: Computer Science Department, Boston University
Publication date: 23/11/2013
Field of study

Quest-V is a system we have been developing from the ground up, with objectives focusing on safety, predictability and efficiency. It is designed to work on emerging multicore processors with hardware virtualization support. Quest-V is implemented as a ``distributed system on a chip'' and comprises multiple sandbox kernels. Sandbox kernels are isolated from one another in separate regions of physical memory, having access to a subset of processing cores and I/O devices. This partitioning prevents system failures in one sandbox affecting the operation of other sandboxes. Shared memory channels managed by system monitors enable inter-sandbox communication. The distributed nature of Quest-V means each sandbox has a separate physical clock, with all event timings being managed by per-core local timers. Each sandbox is responsible for its own scheduling and I/O management, without requiring intervention of a hypervisor. In this paper, we formulate bounds on inter-sandbox communication in the absence of a global scheduler or global system clock. We also describe how address space migration between sandboxes can be guaranteed without violating service constraints. Experimental results on a working system show the conditions under which Quest-V performs real-time communication and migration.National Science Foundation (1117025

Boston University Institutional Repository (OpenBU)

Performance evaluation over HW/SW co-design SoC memory transfers for a CNN accelerator

Author: Amaya Rodríguez Claudio Antonio
Domínguez Morales Manuel Jesús
Jiménez Fernández Ángel Francisco
Linares Barranco Alejandro
Ríos Navarro José Antonio
Tapiador Morales Ricardo
Publication venue: IEEE Computer Society
Publication date: 01/01/2018
Field of study

Many FPGAs vendors have recently included embedded processors in their devices, like Xilinx with ARM-Cortex A cores, together with programmable logic cells. These devices are known as Programmable System on Chip (PSoC). Their ARM cores (embedded in the processing system or PS) communicates with the programmable logic cells (PL) using ARM-standard AXI buses. In this paper we analyses the performance of exhaustive data transfers between PS and PL for a Xilinx Zynq FPGA in a co-design real scenario for Convolutional Neural Networks (CNN) accelerator, which processes, in dedicated hardware, a stream of visual information from a neuromorphic visual sensor for classification. In the PS side, a Linux operating system is running, which recollects visual events from the neuromorphic sensor into a normalized frame, and then it transfers these frames to the accelerator of multi-layered CNNs, and read results, using an AXI-DMA bus in a per-layer way. As these kind of accelerators try to process information as quick as possible, data bandwidth becomes critical and maintaining a good balanced data throughput rate requires some considerations. We present and evaluate several data partitioning techniques to improve the balance between RX and TX transfer and two different ways of transfers management: through a polling routine at the userlevel of the OS, and through a dedicated interrupt-based kernellevel driver. We demonstrate that for longer enough packets, the kernel-level driver solution gets better timing in computing a CNN classification example. Main advantage of using kernel-level driver is to have safer solutions and to have tasks scheduling in the OS to manage other important processes for our application, like frames collection from sensors and their normalization.Ministerio de Economía y Competitividad TEC2016-77785-