64,819 research outputs found
Communication costs in a multi-tiered MPSoC
The amount of digital processing required for phased array beamformers is very large. It requires many parallel processors, which can be organized in a multi-tiered structure. Communication costs differ for each of the stages in such an architecture. For example, communication costs from the antenna front-end to the first processing stages is costly because of the amount of connections and data rate. Furthermore there is a trade-off between sequential processing exploiting locality of reference versus exploiting parallelism but adding communication costs. Thus, the optimal architecture depends on the importance that is given to the different measures.\ud
\ud
A model is presented to determine the partitioning of a (beamforming) system based on communication costs. It is shown that different solutions can be explored based on the cost model and the incorporated quantitative and qualitative measures. Determining the importance of each measure is subjective to the situation and application. In this work a simple beamforming application is used optimised for energy efficiency
Multi-core Architectures and Streaming Applications
In this paper we focus on algorithms and reconfigurable multi-core architectures for streaming digital signal processing (DSP) applications. The multi-core concept has a number of advantages: (1) depending on the requirements more or fewer cores can be switched on/off, (2) the multi-core structure fits well to future process technologies, more cores will be available in advanced process technologies, but the complexity per core does not increase, (3) the multi-core concept is fault tolerant, faulty cores can be discarded and (4) multiple cores can be configured fast in parallel. Because in our approach processing and memory are combined in the cores, tasks can be executed efficiently on cores (locality of reference). There are a number of application domains that can be considered as streaming DSP applications: for example wireless baseband processing (for HiperLAN/2, WiMax, DAB, DRM, and DVB), multimedia processing (e.g. MPEG, MP3 coding/decoding), medical image processing, colour image processing, sensor processing (e.g. remote surveillance cameras) and phased array radar systems. In this paper the key characteristics of streaming DSP applications are highlighted, and the characteristics of the processing architectures to efficiently support these types of applications are addressed. We present the initial results of the Annabelle chip that we designed with our approach
Analysing Astronomy Algorithms for GPUs and Beyond
Astronomy depends on ever increasing computing power. Processor clock-rates
have plateaued, and increased performance is now appearing in the form of
additional processor cores on a single chip. This poses significant challenges
to the astronomy software community. Graphics Processing Units (GPUs), now
capable of general-purpose computation, exemplify both the difficult
learning-curve and the significant speedups exhibited by massively-parallel
hardware architectures. We present a generalised approach to tackling this
paradigm shift, based on the analysis of algorithms. We describe a small
collection of foundation algorithms relevant to astronomy and explain how they
may be used to ease the transition to massively-parallel computing
architectures. We demonstrate the effectiveness of our approach by applying it
to four well-known astronomy problems: Hogbom CLEAN, inverse ray-shooting for
gravitational lensing, pulsar dedispersion and volume rendering. Algorithms
with well-defined memory access patterns and high arithmetic intensity stand to
receive the greatest performance boost from massively-parallel architectures,
while those that involve a significant amount of decision-making may struggle
to take advantage of the available processing power.Comment: 10 pages, 3 figures, accepted for publication in MNRA
Mapping and Scheduling of Directed Acyclic Graphs on An FPFA Tile
An architecture for a hand-held multimedia device requires components that are energy-efficient, flexible, and provide high performance. In the CHAMELEON [4] project we develop a coarse grained reconfigurable device for DSP-like algorithms, the so-called Field Programmable Function Array (FPFA). The FPFA devices are reminiscent to FPGAs, but with a matrix of Processing Parts (PP) instead of CLBs. The design of the FPFA focuses on: (1) Keeping each PP small to maximize the number of PPs that can fit on a chip; (2) providing sufficient flexibility; (3) Low energy consumption; (4) Exploiting the maximum amount of parallelism; (5) A strong support tool for FPFA-based applications. The challenge in providing compiler support for the FPFA-based design stems from the flexibility of the FPFA structure. If we do not use the characteristics of the FPFA structure properly, the advantages of an FPFA may become its disadvantages. The GECKO1project focuses on this problem. In this paper, we present a mapping and scheduling scheme for applications running on one FPFA tile. Applications are written in C and C code is translated to a Directed Acyclic Graphs (DAG) [4]. This scheme can map a DAG directly onto the reconfigurable PPs of an FPFA tile. It tries to achieve low power consumption by exploiting locality of reference and high performance by exploiting maximum parallelism
Cache-conscious Splitting of MapReduce Tasks and its Application to Stencil Computations
Modern cluster systems are typically composed by nodes with multiple processing
units and memory hierarchies comprising multiple cache levels of various sizes. To leverage
the full potential of these architectures it is necessary to explore concepts such as
parallel programming and the layout of data onto the memory hierarchy. However, the
inherent complexity of these concepts and the heterogeneity of the target architectures
raises several challenges at application development and performance portability levels,
respectively. In what concerns parallel programming, several model and frameworks
are available, of which MapReduce [16] is one of the most popular. It was developed
at Google [16] for the parallel and distributed processing of large amounts of data in
large clusters of commodity machines. Although being very powerful tools, the reference
MapReduce frameworks, such as Hadoop and Spark, do not leverage the characteristics
of the underlying memory hierarchy. This shortcoming is particularly noticeable in
computations that benefit from temporal locality, such as stencil computations.
In this context, the goal of this thesis is to improve the performance of MapReduce
computations that benefit from temporal locality. To that end we optimize the mapping
of MapReduce computations in a machine’s cache memory hierarchy by applying cacheaware
tiling techniques. We prototyped our solution on top of the framework Hadoop
MapReduce, incorporating a cache-awareness in the splitting stage.
To validate our solution and assess its benefits, we developed an API for expressing
stencil computations on top the developed framework. The experimental results show
that, for a typical stencil computation, our solution delivers an average speed-up of 1.77
while reaching a peek speed-up of 3.2. These findings allows us to conclude that cacheaware
decomposition of MapReduce computations considerably boosts the execution of
this class of MapReduce computations
Near-optimal loop tiling by means of cache miss equations and genetic algorithms
The effectiveness of the memory hierarchy is critical for the performance of current processors. The performance of the memory hierarchy can be improved by means of program transformations such as loop tiling, which is a code transformation targeted to reduce capacity misses. This paper presents a novel systematic approach to perform near-optimal loop tiling based on an accurate data locality analysis (cache miss equations) and a powerful technique to search the solution space that is based on a genetic algorithm. The results show that this approach can remove practically all capacity misses for all considered benchmarks. The reduction of replacement misses results in a decrease of the miss ratio that can be as significant as a factor of 7 for the matrix multiply kernel.Peer ReviewedPostprint (published version
Parallel Construction of Wavelet Trees on Multicore Architectures
The wavelet tree has become a very useful data structure to efficiently
represent and query large volumes of data in many different domains, from
bioinformatics to geographic information systems. One problem with wavelet
trees is their construction time. In this paper, we introduce two algorithms
that reduce the time complexity of a wavelet tree's construction by taking
advantage of nowadays ubiquitous multicore machines.
Our first algorithm constructs all the levels of the wavelet in parallel in
time and bits of working space, where
is the size of the input sequence and is the size of the alphabet. Our
second algorithm constructs the wavelet tree in a domain-decomposition fashion,
using our first algorithm in each segment, reaching time and
bits of extra space, where is the
number of available cores. Both algorithms are practical and report good
speedup for large real datasets.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
- …