24,672 research outputs found
Multi-core Architectures and Streaming Applications
In this paper we focus on algorithms and reconfigurable multi-core architectures for streaming digital signal processing (DSP) applications. The multi-core concept has a number of advantages: (1) depending on the requirements more or fewer cores can be switched on/off, (2) the multi-core structure fits well to future process technologies, more cores will be available in advanced process technologies, but the complexity per core does not increase, (3) the multi-core concept is fault tolerant, faulty cores can be discarded and (4) multiple cores can be configured fast in parallel. Because in our approach processing and memory are combined in the cores, tasks can be executed efficiently on cores (locality of reference). There are a number of application domains that can be considered as streaming DSP applications: for example wireless baseband processing (for HiperLAN/2, WiMax, DAB, DRM, and DVB), multimedia processing (e.g. MPEG, MP3 coding/decoding), medical image processing, colour image processing, sensor processing (e.g. remote surveillance cameras) and phased array radar systems. In this paper the key characteristics of streaming DSP applications are highlighted, and the characteristics of the processing architectures to efficiently support these types of applications are addressed. We present the initial results of the Annabelle chip that we designed with our approach
FastFlow: Efficient Parallel Streaming Applications on Multi-core
Shared memory multiprocessors come back to popularity thanks to rapid
spreading of commodity multi-core architectures. As ever, shared memory
programs are fairly easy to write and quite hard to optimise; providing
multi-core programmers with optimising tools and programming frameworks is a
nowadays challenge. Few efforts have been done to support effective streaming
applications on these architectures. In this paper we introduce FastFlow, a
low-level programming framework based on lock-free queues explicitly designed
to support high-level languages for streaming applications. We compare FastFlow
with state-of-the-art programming frameworks such as Cilk, OpenMP, and Intel
TBB. We experimentally demonstrate that FastFlow is always more efficient than
all of them in a set of micro-benchmarks and on a real world application; the
speedup edge of FastFlow over other solutions might be bold for fine grain
tasks, as an example +35% on OpenMP, +226% on Cilk, +96% on TBB for the
alignment of protein P01111 against UniProt DB using Smith-Waterman algorithm.Comment: 23 pages + cove
Cache-Conscious Offline Real-Time Task Scheduling for Multi-Core Processors
Most schedulability analysis techniques for multi-core architectures assume a single Worst-Case Execution Time (WCET) per task, which is valid in all execution conditions. This assumption is too pessimistic for parallel applications running on multi-core architectures with local instruction or data caches, for which the WCET of a task depends on the cache contents at the beginning of its execution, itself depending on the task that was executed before the task under study.
In this paper, we propose two scheduling techniques for multi-core architectures equipped with local instruction and data caches. The two techniques schedule a parallel application modeled as a task graph, and generate a static partitioned non-preemptive schedule. We propose an optimal method, using an Integer Linear Programming (ILP) formulation, as well as a heuristic method based on list scheduling. Experimental results show that by taking into account the effect of private caches on tasks\u27 WCETs, the length of generated schedules is significantly reduced as compared to schedules generated by cache-unaware scheduling methods. The observed schedule length reduction on streaming applications is 11% on average for the optimal method and 9% on average for the heuristic method
Design and optimization of a portable LQCD Monte Carlo code using OpenACC
The present panorama of HPC architectures is extremely heterogeneous, ranging
from traditional multi-core CPU processors, supporting a wide class of
applications but delivering moderate computing performance, to many-core GPUs,
exploiting aggressive data-parallelism and delivering higher performances for
streaming computing applications. In this scenario, code portability (and
performance portability) become necessary for easy maintainability of
applications; this is very relevant in scientific computing where code changes
are very frequent, making it tedious and prone to error to keep different code
versions aligned. In this work we present the design and optimization of a
state-of-the-art production-level LQCD Monte Carlo application, using the
directive-based OpenACC programming model. OpenACC abstracts parallel
programming to a descriptive level, relieving programmers from specifying how
codes should be mapped onto the target architecture. We describe the
implementation of a code fully written in OpenACC, and show that we are able to
target several different architectures, including state-of-the-art traditional
CPUs and GPUs, with the same code. We also measure performance, evaluating the
computing efficiency of our OpenACC code on several architectures, comparing
with GPU-specific implementations and showing that a good level of
performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for
consideration in International Journal of Modern Physics
HSTREAM: A directive-based language extension for heterogeneous stream computing
Big data streaming applications require utilization of heterogeneous parallel
computing systems, which may comprise multiple multi-core CPUs and many-core
accelerating devices such as NVIDIA GPUs and Intel Xeon Phis. Programming such
systems require advanced knowledge of several hardware architectures and
device-specific programming models, including OpenMP and CUDA. In this paper,
we present HSTREAM, a compiler directive-based language extension to support
programming stream computing applications for heterogeneous parallel computing
systems. HSTREAM source-to-source compiler aims to increase the programming
productivity by enabling programmers to annotate the parallel regions for
heterogeneous execution and generate target specific code. The HSTREAM runtime
automatically distributes the workload across CPUs and accelerating devices. We
demonstrate the usefulness of HSTREAM language extension with various
applications from the STREAM benchmark. Experimental evaluation results show
that HSTREAM can keep the same programming simplicity as OpenMP, and the
generated code can deliver performance beyond what CPUs-only and GPUs-only
executions can deliver.Comment: Preprint, 21st IEEE International Conference on Computational Science
and Engineering (CSE 2018
A Power-Aware Framework for Executing Streaming Programs on Networks-on-Chip
Nilesh Karavadara, Simon Folie, Michael Zolda, Vu Thien Nga Nguyen, Raimund Kirner, 'A Power-Aware Framework for Executing Streaming Programs on Networks-on-Chip'. Paper presented at the Int'l Workshop on Performance, Power and Predictability of Many-Core Embedded Systems (3PMCES'14), Dresden, Germany, 24-28 March 2014.Software developers are discovering that practices which have successfully served single-core platforms for decades do no longer work for multi-cores. Stream processing is a parallel execution model that is well-suited for architectures with multiple computational elements that are connected by a network. We propose a power-aware streaming execution layer for network-on-chip architectures that addresses the energy constraints of embedded devices. Our proof-of-concept implementation targets the Intel SCC processor, which connects 48 cores via a network-on- chip. We motivate our design decisions and describe the status of our implementation
Worst-Case Communication Overhead in a Many-Core based Shared-Memory Model
National audienceWith emerging many-core architectures, using on-chip shared memories is an interesting approach because it provides high bandwidth and high throughput data exchange. Such a feature is usually implemented as a multi-bus multi-banked memory. Since predicting timing behavior is key to efficient design and verification of embedded real-time systems, the question that arises is how to evaluate the access time for one memory access of a given task while others may concurrently access the same memory-bank at t the same time. In this paper, we give the answers for a subset of streaming applications modeled like CSDF Model of Computation and implemented in Kalray’s MPPA chip
- …